Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated Jupyter Notebook #466

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 17 additions & 37 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,37 +1,17 @@
# Real Python Materials

Bonus materials, exercises, and example projects for Real Python's [Python tutorials](https://realpython.com).

Build Status:
[![GitHub Actions](https://img.shields.io/github/actions/workflow/status/realpython/materials/linters.yml?branch=master)](https://github.com/realpython/materials/actions)

## Got a Question?

The best way to get support for Real Python courses, articles, and code in this repository is to join one of our [weekly Office Hours calls](https://realpython.com/office-hours/) or to ask your question in the [RP Community Chat](https://realpython.com/community/).

Due to time constraints, we cannot provide 1:1 support via GitHub. See you on Slack or on the next Office Hours call 🙂

## Adding Source Code & Sample Projects to This Repo (RP Contributors)

### Running Code Style Checks

We use [flake8](http://flake8.pycqa.org/en/latest/) and [black](https://black.readthedocs.io/) to ensure a consistent code style for all of our sample code in this repository.

Run the following commands to validate your code against the linters:

```sh
$ flake8
$ black --check .
```

### Running Python Code Formatter

We're using a tool called [black](https://black.readthedocs.io/) on this repo to ensure consistent formatting. On CI it runs in "check" mode to ensure any new files added to the repo follow PEP 8. If you see linter warnings that say something like "would reformat some_file.py" it means that black disagrees with your formatting.

**The easiest way to resolve these errors is to run Black locally on the code and then commit those changes, as explained below.**

To automatically re-format your code to be consistent with our code style guidelines, run [black](https://black.readthedocs.io/) in the repository root folder:

```sh
$ black .
```
# Using Python for Data Analysis

This folder contains completed notebooks and other files used in the Real Python tutorial on [Using Python for Data Analysis](https://realpython.com/using-python-for-data-analysis/).

None of the files are mandatory to complete the tutorial, however, you may find them of use for reference during the tutorial.

## Available Files:

`data analysis findings.ipynb` is a Jupyter Notebook containing all the code used in the tutorial.
`data analysis results.ipynb` is a Jupyter Notebook containing the final version of the cleansing and analysis code.
`james_bond_data.csv` contains the data to be cleansed and analyzed in its original form, in CSV format.
`james_bond_data.json` contains the data to be cleansed and analyzed in its original form, in JSON format.
`james_bond_data.parquet` contains the data to be cleansed and analyzed in its original form, in parquet format.
`james_bond_data.xlsx` contains the data to be cleansed and analyzed in its original form, in Microsoft Excel format.
`james_bond_data_cleansed.csv` contains the cleansed data in its final form.

## Although the tutorial can be completed in a range of Python environments, the use of Jupyter Notebook within JupyterLab is highly recommended.
222 changes: 222 additions & 0 deletions data analysis results.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,222 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "ade4bd3f-543b-460b-980f-0b41aab2c8b6",
"metadata": {},
"source": [
"# Data Cleansing Code"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a360772e-7829-4c15-9af9-d4596efc7351",
"metadata": {},
"outputs": [],
"source": [
"!python -m pip install pandas"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c98c7640-1472-4869-9fdd-f070d665ae1d",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"james_bond_data = pd.read_csv(\"james_bond_data.csv\").convert_dtypes()\n",
"\n",
"data = (\n",
" james_bond_data.rename(columns=new_column_names)\n",
" .combine_first(\n",
" pd.DataFrame(\n",
" {\"imdb_rating\": {10: 7.1}, \"rotten_tomatoes_rating\": {10: 6.8}}\n",
" )\n",
" )\n",
" .assign(\n",
" gross_income_usa=lambda data: (\n",
" data[\"gross_income_usa\"]\n",
" .replace(\"[$,]\", \"\", regex=True)\n",
" .astype(float)\n",
" ),\n",
" gross_income_world=lambda data: (\n",
" data[\"gross_income_world\"]\n",
" .replace(\"[$,]\", \"\", regex=True)\n",
" .astype(float)\n",
" ),\n",
" movie_budget=lambda data: (\n",
" data[\"movie_budget\"].replace(\"[$,]\", \"\", regex=True).astype(float)\n",
" * 1000\n",
" ),\n",
" film_length=lambda data: (\n",
" data[\"film_length\"]\n",
" .str.removesuffix(\"mins\")\n",
" .astype(int)\n",
" .replace(1200, 120)\n",
" ),\n",
" release_date=lambda data: pd.to_datetime(\n",
" data[\"release_date\"], format=\"%B, %Y\"\n",
" ),\n",
" release_Year=lambda data: data[\"release_date\"].dt.year,\n",
" bond_actor=lambda data: (\n",
" data[\"bond_actor\"]\n",
" .str.replace(\"Shawn\", \"Sean\")\n",
" .str.replace(\"MOORE\", \"Moore\")\n",
" ),\n",
" car_manufacturer=lambda data: data[\"car_manufacturer\"].str.replace(\n",
" \"Astin\", \"Aston\"\n",
" ),\n",
" martinis_consumed=lambda data: data[\"martinis_consumed\"].replace(\n",
" -6, 6\n",
" ),\n",
" )\n",
").drop_duplicates(ignore_index=True)\n",
"\n",
"data.to_csv(\"james_bond_data_cleansed.csv\", index=False)"
]
},
{
"cell_type": "markdown",
"id": "f50918ee-e61f-46b2-b0c2-1ffa2c62bbc0",
"metadata": {},
"source": [
"# Data Analysis Code"
]
},
{
"cell_type": "markdown",
"id": "86817f68-05a0-4235-a1c8-a5d1f6e9141e",
"metadata": {},
"source": [
"## Performing a Regression Analysis"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bee6d6cb-e418-4c1d-8b75-604b9ab2e63d",
"metadata": {},
"outputs": [],
"source": [
"!python -m pip install matplotlib scikit-learn"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "669fb9d7-d744-4e6b-899e-a69aebec53ed",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import LinearRegression\n",
"import matplotlib.pyplot as plt\n",
"\n",
"from sklearn.linear_model import LinearRegression\n",
"\n",
"x = data.loc[:, [\"imdb_rating\"]]\n",
"y = data.loc[:, \"rotten_tomatoes_rating\"]\n",
"\n",
"model = LinearRegression()\n",
"model.fit(x, y)\n",
"\n",
"r_squared = f\"R-Squared: {model.score(x, y):.2f}\"\n",
"best_fit = f\"y = {model.coef_[0]:.4f}x{model.intercept_:+.4f}\"\n",
"y_pred = model.predict(x)\n",
"\n",
"fig, ax = plt.subplots()\n",
"ax.scatter(x, y)\n",
"ax.plot(x, y_pred, color=\"red\")\n",
"ax.text(7.25, 5.5, r_squared, fontsize=10)\n",
"ax.text(7.25, 7, best_fit, fontsize=10)\n",
"ax.set_title(\"Scatter Plot of Ratings\")\n",
"ax.set_xlabel(\"Average IMDB Rating\")\n",
"ax.set_ylabel(\"Average Rotten Tomatoes Rating\")\n",
"# fig.show()"
]
},
{
"cell_type": "markdown",
"id": "b38df412-c320-49fb-93ae-e253405537a8",
"metadata": {},
"source": [
"## Investigating a Statistical Distribution"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "938e5942-e57f-4e41-99f1-215cfb37d0df",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# fig, ax = plt.subplots()\n",
"length = data[\"film_length\"].value_counts(bins=7).sort_index()\n",
"length.plot.bar(\n",
" title=\"Film Length Distribution\",\n",
" xlabel=\"Time Range (mins)\",\n",
" ylabel=\"Count\",\n",
")\n",
"# fig.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ff4e9955-baf4-48eb-b032-fbf55f439194",
"metadata": {},
"outputs": [],
"source": [
"data[\"film_length\"].agg([\"mean\", \"max\", \"min\", \"std\"])"
]
},
{
"cell_type": "markdown",
"id": "1b14c433-c3a6-4484-bc0a-26825bd1e870",
"metadata": {},
"source": [
"## Finding No Relationship"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2bb83374-347f-4cf6-bc21-8180a003371d",
"metadata": {},
"outputs": [],
"source": [
"fig, ax = plt.subplots()\n",
"ax.scatter(data[\"imdb_rating\"], data[\"bond_kills\"])\n",
"ax.set_title(\"Scatter Plot of Kills vs Ratings\")\n",
"ax.set_xlabel(\"Average IMDb Rating\")\n",
"ax.set_ylabel(\"Kills by Bond\")\n",
"fig.show()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.0"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading
Loading