realpython · eyrei123 · Nov 19, 2023 · Nov 20, 2023 · Nov 20, 2023 · Dec 3, 2023
diff --git a/README.md b/README.md
@@ -1,37 +1,17 @@
-# Real Python Materials
-
-Bonus materials, exercises, and example projects for Real Python's [Python tutorials](https://realpython.com).
-
-Build Status:
-[![GitHub Actions](https://img.shields.io/github/actions/workflow/status/realpython/materials/linters.yml?branch=master)](https://github.com/realpython/materials/actions)
-
-## Got a Question?
-
-The best way to get support for Real Python courses, articles, and code in this repository is to join one of our [weekly Office Hours calls](https://realpython.com/office-hours/) or to ask your question in the [RP Community Chat](https://realpython.com/community/).
-
-Due to time constraints, we cannot provide 1:1 support via GitHub. See you on Slack or on the next Office Hours call 🙂
-
-## Adding Source Code & Sample Projects to This Repo (RP Contributors)
-
-### Running Code Style Checks
-
-We use [flake8](http://flake8.pycqa.org/en/latest/) and [black](https://black.readthedocs.io/) to ensure a consistent code style for all of our sample code in this repository.
-
-Run the following commands to validate your code against the linters:
-
-```sh
-$ flake8
-$ black --check .
-```
-
-### Running Python Code Formatter
-
-We're using a tool called [black](https://black.readthedocs.io/) on this repo to ensure consistent formatting. On CI it runs in "check" mode to ensure any new files added to the repo follow PEP 8. If you see linter warnings that say something like "would reformat some_file.py" it means that black disagrees with your formatting.
-
-**The easiest way to resolve these errors is to run Black locally on the code and then commit those changes, as explained below.**
-
-To automatically re-format your code to be consistent with our code style guidelines, run [black](https://black.readthedocs.io/) in the repository root folder:
-
-```sh
-$ black .
-```
+# Using Python for Data Analysis
+
+This folder contains completed notebooks and other files used in the Real Python tutorial on [Using Python for Data Analysis](https://realpython.com/using-python-for-data-analysis/). 
+
+None of the files are mandatory to complete the tutorial, however, you may find them of use for reference during the tutorial.
+
+## Available Files:
+
+`data analysis findings.ipynb` is a Jupyter Notebook containing all the code used in the tutorial.
+`data analysis results.ipynb` is a Jupyter Notebook containing the final version of the cleansing and analysis code.
+`james_bond_data.csv` contains the data to be cleansed and analyzed in its original form, in CSV format.
+`james_bond_data.json` contains the data to be cleansed and analyzed in its original form, in JSON format.
+`james_bond_data.parquet` contains the data to be cleansed and analyzed in its original form, in parquet format.
+`james_bond_data.xlsx` contains the data to be cleansed and analyzed in its original form, in Microsoft Excel format.
+`james_bond_data_cleansed.csv` contains the cleansed data in its final form.
+
+## Although the tutorial can be completed in a range of Python environments, the use of Jupyter Notebook within JupyterLab is highly recommended.
diff --git a/data analysis results.ipynb b/data analysis results.ipynb
@@ -0,0 +1,222 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "ade4bd3f-543b-460b-980f-0b41aab2c8b6",
+   "metadata": {},
+   "source": [
+    "# Data Cleansing Code"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a360772e-7829-4c15-9af9-d4596efc7351",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!python -m pip install pandas"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c98c7640-1472-4869-9fdd-f070d665ae1d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "\n",
+    "james_bond_data = pd.read_csv(\"james_bond_data.csv\").convert_dtypes()\n",
+    "\n",
+    "data = (\n",
+    "    james_bond_data.rename(columns=new_column_names)\n",
+    "    .combine_first(\n",
+    "        pd.DataFrame(\n",
+    "            {\"imdb_rating\": {10: 7.1}, \"rotten_tomatoes_rating\": {10: 6.8}}\n",
+    "        )\n",
+    "    )\n",
+    "    .assign(\n",
+    "        gross_income_usa=lambda data: (\n",
+    "            data[\"gross_income_usa\"]\n",
+    "            .replace(\"[$,]\", \"\", regex=True)\n",
+    "            .astype(float)\n",
+    "        ),\n",
+    "        gross_income_world=lambda data: (\n",
+    "            data[\"gross_income_world\"]\n",
+    "            .replace(\"[$,]\", \"\", regex=True)\n",
+    "            .astype(float)\n",
+    "        ),\n",
+    "        movie_budget=lambda data: (\n",
+    "            data[\"movie_budget\"].replace(\"[$,]\", \"\", regex=True).astype(float)\n",
+    "            * 1000\n",
+    "        ),\n",
+    "        film_length=lambda data: (\n",
+    "            data[\"film_length\"]\n",
+    "            .str.removesuffix(\"mins\")\n",
+    "            .astype(int)\n",
+    "            .replace(1200, 120)\n",
+    "        ),\n",
+    "        release_date=lambda data: pd.to_datetime(\n",
+    "            data[\"release_date\"], format=\"%B, %Y\"\n",
+    "        ),\n",
+    "        release_Year=lambda data: data[\"release_date\"].dt.year,\n",
+    "        bond_actor=lambda data: (\n",
+    "            data[\"bond_actor\"]\n",
+    "            .str.replace(\"Shawn\", \"Sean\")\n",
+    "            .str.replace(\"MOORE\", \"Moore\")\n",
+    "        ),\n",
+    "        car_manufacturer=lambda data: data[\"car_manufacturer\"].str.replace(\n",
+    "            \"Astin\", \"Aston\"\n",
+    "        ),\n",
+    "        martinis_consumed=lambda data: data[\"martinis_consumed\"].replace(\n",
+    "            -6, 6\n",
+    "        ),\n",
+    "    )\n",
+    ").drop_duplicates(ignore_index=True)\n",
+    "\n",
+    "data.to_csv(\"james_bond_data_cleansed.csv\", index=False)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f50918ee-e61f-46b2-b0c2-1ffa2c62bbc0",
+   "metadata": {},
+   "source": [
+    "# Data Analysis Code"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "86817f68-05a0-4235-a1c8-a5d1f6e9141e",
+   "metadata": {},
+   "source": [
+    "## Performing a Regression Analysis"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bee6d6cb-e418-4c1d-8b75-604b9ab2e63d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!python -m pip install matplotlib scikit-learn"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "669fb9d7-d744-4e6b-899e-a69aebec53ed",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.linear_model import LinearRegression\n",
+    "import matplotlib.pyplot as plt\n",
+    "\n",
+    "from sklearn.linear_model import LinearRegression\n",
+    "\n",
+    "x = data.loc[:, [\"imdb_rating\"]]\n",
+    "y = data.loc[:, \"rotten_tomatoes_rating\"]\n",
+    "\n",
+    "model = LinearRegression()\n",
+    "model.fit(x, y)\n",
+    "\n",
+    "r_squared = f\"R-Squared: {model.score(x, y):.2f}\"\n",
+    "best_fit = f\"y = {model.coef_[0]:.4f}x{model.intercept_:+.4f}\"\n",
+    "y_pred = model.predict(x)\n",
+    "\n",
+    "fig, ax = plt.subplots()\n",
+    "ax.scatter(x, y)\n",
+    "ax.plot(x, y_pred, color=\"red\")\n",
+    "ax.text(7.25, 5.5, r_squared, fontsize=10)\n",
+    "ax.text(7.25, 7, best_fit, fontsize=10)\n",
+    "ax.set_title(\"Scatter Plot of Ratings\")\n",
+    "ax.set_xlabel(\"Average IMDB Rating\")\n",
+    "ax.set_ylabel(\"Average Rotten Tomatoes Rating\")\n",
+    "# fig.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b38df412-c320-49fb-93ae-e253405537a8",
+   "metadata": {},
+   "source": [
+    "## Investigating a Statistical Distribution"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "938e5942-e57f-4e41-99f1-215cfb37d0df",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "# fig, ax = plt.subplots()\n",
+    "length = data[\"film_length\"].value_counts(bins=7).sort_index()\n",
+    "length.plot.bar(\n",
+    "    title=\"Film Length Distribution\",\n",
+    "    xlabel=\"Time Range (mins)\",\n",
+    "    ylabel=\"Count\",\n",
+    ")\n",
+    "# fig.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ff4e9955-baf4-48eb-b032-fbf55f439194",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data[\"film_length\"].agg([\"mean\", \"max\", \"min\", \"std\"])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1b14c433-c3a6-4484-bc0a-26825bd1e870",
+   "metadata": {},
+   "source": [
+    "## Finding No Relationship"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2bb83374-347f-4cf6-bc21-8180a003371d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fig, ax = plt.subplots()\n",
+    "ax.scatter(data[\"imdb_rating\"], data[\"bond_kills\"])\n",
+    "ax.set_title(\"Scatter Plot of Kills vs Ratings\")\n",
+    "ax.set_xlabel(\"Average IMDb Rating\")\n",
+    "ax.set_ylabel(\"Kills by Bond\")\n",
+    "fig.show()"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}