# **Lab 1: Part I - Review of Data Science**
---
### **Description**
This lab provides a comprehensive overview of exploratory data analysis (EDA) techniques using Python's pandas library for data manipulation and analysis. Additionally, it explores data visualization using the matplotlib library. Throughout the notebook, you'll review how to load and manipulate datasets effectively with pandas commands and leverage matplotlib to create insightful visualizations that aid in uncovering patterns, trends, and insights within the data.

<br>

### **Lab Structure**
**Part 1**: [Exploratory Data Analysis Review](#ip1)

  >  **Part 1.1**: [Basic Commands](#ip1.1)

  >  **Part 1.2**: [Further Exploration](#ip1.2)

**Part 2**: [Data Visualization Review](#ip2)

  >  **Part 2.1**: [Scatter Plots](#ip2.1)

  >  **Part 2.2**: [Line Plots](#ip2.2)

  >  **Part 2.3**: [Bar Plots](#ip2.3)

**Part 3**: [[OPTIONAL] Improving Visualizations](#ip3)
  >  **Part 3.1**: [Improving Scatter Plots](#ip3.1)

  >  **Part 3.2**: [Improving Line Plots](#ip3.2)

  >  **Part 3.3**: [Improving Bar Plots](#ip3.3)

  >  **Part 3.4**: [Enhancing Plot Aesthetics](#ip3.4)



<br>

### **Learning Objectives**
 By the end of this lab, we will:
* Understand basic pandas commands for EDA.

* Understand basic matplotlib commands for Data Visualization.


<br>


### **Resources**
* [EDA with pandas Cheat Sheet](https://docs.google.com/document/d/1FFoqw45P-kuoq912ARP4qfdGeLTqoq73_qjZThPp2_8/edit?usp=drive_link)

* [Data Visualization with matplotlib Cheat Sheet](https://docs.google.com/document/d/1YlUp6ll81qOyDpU1OWzE-SPxQ3hnF5C9ukLRL_6PYKE/edit?usp=drive_link)


<br>

**Before starting, run the code below to import all necessary functions and libraries.**


In [None]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import matplotlib.pyplot as plt

<a name="ip1"></a>

---
## **Part 1: Exploratory Data Analysis Review**
---




<a name="ip1.1"></a>

---
### **Part 1.1: Basic Commands**
---


**Run the code cell below to create the DataFrame.**

In [None]:
df = pd.DataFrame({'U.S. State': ['California', 'Florida', 'Indiana', 'Texas', 'Pennsylvania'],
        'Population (in millions)': [38, 21, 6.5, 28, 13],
        'Capitol': ['Sacramento', 'Tallahassee', 'Indianapolis', 'Austin', 'Harrisburg'],
        'GDP ($ in billions)': [3700, 1070, 352, 1876, 726]})

#### **Problem #1.1.1**

**Together**, let's inspect what `.head()` tells us about this DataFrame.

#### **Problem #1.1.2**

**Together**, let's determine what datatype `Population (in millions)` is.

#### **Problem #1.1.3**

**Together**, let's print all of the unique values for `GDP ($ in billions)`.

---

#### **Now it's your turn! Try Problems #1.1.4 - 1.1.7 on your own.**

---

#### **Problem #1.1.4**

**Independently**, determine the column names in the dataset.

#### **Problem #1.1.5**

**Independently**, determine the highest `GDP ($ in billions)` in the dataset.

#### **Problem #1.1.6**

**Independently**, determine which states are included in this dataset.

#### **Problem #1.1.7**

**Independently**, determine the range of GDP values among the states?

---

<center>

#### **Wait for Your Instructor to Continue**

---

<a name="ip1.2"></a>

---
### **Part 1.2: Further Exploration**
---



#### **Problem #1.2.1**

**Independently**, determine the average `Population (in millions)` size among the U.S. states in the dataset.

#### **Problem #1.2.2**

**Independently,** explore rows 4 and 5. What are the U.S. States listed?

#### **Problem #1.2.3**

**Independently**, determine the total `Population (in millions)` across all states.

#### **Problem #1.2.4**

**Independently**, determine the `Population (in millions)` for the 3rd state in the dataset.

#### **Problem #1.2.5**

**Independently**, determine how many states have a population greater than 20 million.

#### **Problem #1.2.6**

**Independently**, explore the last row in the dataset.

#### **[Challenge Question] Problem #1.2.7**

**Independently**, determine the average `GDP per capita` for the states.

**HINT:** Divide `GDP per capita` by `Population (in millions)`.

---

<center>

#### **Wait for Your Instructor to Continue**

---

<a name="ip2"></a>

---
## **Part 2: Data Visualization Review**
---

**Run the cell below to load in the data**

In [None]:
url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vS9jPkeKJ8QUuAl-fFdg3nJPDP6vx1byvIBl4yW8UZZJ9QEscyALJp1eywKeAg7aAffwdKP63D9osF1/pub?gid=169291584&single=true&output=csv"
movie_df = pd.read_csv(url)

movie_df.drop_duplicates(inplace=True)

mean_runtime = movie_df['Runtime'].mean()
movie_df['Runtime'] = movie_df['Runtime'].fillna(mean_runtime)

movie_df = movie_df.rename(columns = {"Runtime": "Runtime (min)"})
movie_df = movie_df.astype({"Runtime (min)": "int64"})

movie_df.head()

<a name="ip2.1"></a>

---
### **Part 2.1: Scatter Plots**
---

#### **Problem #2.1.1**

**Together**, let's create a scatterplot using `Runtime (min)` as the x-axis value and `Gross` as the y-axis value.

Make sure to include a meaningful:
* `Title`: "Runtime vs. Released_Year:
* `X-axis`: "Runtime (min)"
* `Y-axis`: "Gross (USD)"

---

#### **Now it's your turn! Try Problem #2.1.2 on your own.**

---

#### **Problem #2.1.2**

**Independently**, create a scatterplot using `Released_Year` as the x-axis value and `Runtime (min)` as the y-axis value.

Make sure to include a meaningful:
* `Title`: "Runtime vs. Released_Year"
* `X-axis`: "Year"
* `Y-axis`: "Runtime (min)"

---

<center>

#### **Wait for Your Instructor to Continue**

---

<a name="ip2.2"></a>

---
### **Part 2.2: Line Plots**
---

#### **Problem #2.2.1**

**Together**, let's create a line plot using `Runtime (min)` as the x-axis value and `Gross` as the y-axis value.

Make sure to include a meaningful:
* Title, ex: `'Gross Money vs. Runtime'`.
* X-axis label including units `'min'`.
* Y-axis label including units `'USD'`.

<br>

**NOTE**: This is not going to be a particularly helpful graph (the scatter plot is a better choice), but we oftentimes will not know this ahead of time. A lot of EDA and visualizations involves trying a number of things and seeing what is useful.

---

#### **Now it's your turn! Try Problem #2.2.2 on your own.**

---

#### **Problem #2.2.2**

**Independently**, create a line plot using `Released_Year` as the x-axis value and `Average Gross in Year` as the y-axis value.

Make sure to include a meaningful:
* Title, ex: `'Average Gross Money vs. Released Year'`.
* X-axis label.
* Y-axis label including units `'USD'`.

In [None]:
mean_gross = movie_df.groupby(# COMPLETE THIS LINE


---

<center>

#### **Wait for Your Instructor to Continue**

---

<a name="ip2.3"></a>

---
### **Part 2.3: Bar Plots**
---

#### **Problem #2.3.1**

**Together**, let's create a bar plot of the number of movies released per year.

Use the DataFrame provided, `movies_per_year` and make sure to include a meaningful:
* Title.
* X-axis label.
* Y-axis label.

In [None]:
movies_per_year = movie_df['Released_Year'].value_counts()

plt.bar(movies_per_year.index, # COMPLETE THIS CODE

---

#### **Now it's your turn! Try Problem #2.3.2 on your own.**

---

#### **Problem #2.3.2**

**Independently**, create a bar plot of the number of Dramas released per year.

Use the DataFrame provided, `movies_per_year` and make sure to include a meaningful:
* Title.
* X-axis label.
* Y-axis label.

<br>

**Hint**: Recall that you can use `.loc[CRITERIA, :]` to find all data matching given criteria and the example in Problem #6 for finding the number of movies realeased per year.

In [None]:
# COMPLETE THIS CODE

---

<center>

#### **Wait for Your Instructor to Continue**

---

<a name="ip3"></a>

---
## **Part 3: [OPTIONAL] Improving Visualizations**
---

In this section, we will explore several ways to improve upon the visuals we learned to make above.

<a name="ip3.1"></a>

---
### **Part 3.1: Improving Scatter Plots**
---

#### **Problem #3.1.1**

We are given average temperature values for the months of the year for two cities: `city_A` and `city_B`.

**Independently**, plot each city's average temperatures. We'll need to make two scatter plots.

Make `city_A` markers blue and `city_B` markers red. Add labels and a legend.

From the graph, which city is most likely located in the Northeast?

In [None]:
city_A_temps = [60,65,67, 70, 77, 84, 94, 101, 90, 82, 62]
city_B_temps = [-11, 14, 25, 32, 55, 73, 87, 92, 82, 66, 53]
months = np.arange(1,12)

# COMPLETE THE REST OF THE CODE

#### **Problem #3.1.2**

**Independently**, adjust the plot so that `city_A` markers are black and `city_B` markers are green.

In [None]:
city_A_temps = [60,65,67, 70, 77, 84, 94, 101, 90, 82, 62]
city_B_temps = [-11, 14, 25, 32, 55, 73, 87, 92, 82, 66, 53]
months = np.arange(1,12)

# COMPLETE THE REST OF THE CODE

---

<center>

#### **Wait for Your Instructor to Continue**

---

<a name="ip3.2"></a>

---
### **Part 3.2: Improving Line Plots**
---

#### **Problem #3.2.1**

**Independently**, create a line plot with the following features:


* A dashed line
* A grid




In [None]:
Year = [1920,1930,1940,1950,1960,1970,1980,1990,2000,2010]
Unemployment_Rate = [9.8,12,8,7.2,6.9,7,6.5,6.2,5.5,6.3]

# COMPLETE THE REST OF THE CODE

#### **Problem #3.2.2**

**Independently**, using the following data, create a line plot. In addition:
* Make that line dashed and dotted with `"-."`
* Add a grid to the background

In [None]:
# x axis values
x = [1,2,3]
# corresponding y axis values
y = [2,4,1]

# COMPLETE THE REST OF THE CODE

---

<center>

#### **Wait for Your Instructor to Continue**

---

<a name="ip3.3"></a>

---
### **Part 3.3: Improving Bar Plots**
---

#### **Problem #3.3.1**

**Independently**, make a bar plot with each bar as a different color.

In [None]:
langs = ['English', 'French', 'Spanish', 'Chinese', 'Arabic']
students = [23,17,35,29,12]

# COMPLETE THE REST OF THE CODE

#### **Problem #3.3.2**

**Independently**, use the following data to create a simple bar plot. Make all of the bars blue except bar `E`; make bar `E` red.

In [None]:
height = [3, 12, 5, 18, 45]       # y
bars = ['A', 'B', 'C', 'D', 'E']  # x

# COMPLETE THE REST OF THE CODE

---

<center>

#### **Wait for Your Instructor to Continue**

---

<a name="ip3.4"></a>

---
### **Part 3.4: Enhancing Plot Aesthetics**
---

#### **Run the cell below to import the data for the following problems.**

This dataset contains information on U.S. agricultural exports in 2011.

In [None]:
url = 'https://raw.githubusercontent.com/plotly/datasets/master/2011_us_ag_exports.csv'
export_df = pd.read_csv(url)
export_df.head()

#### **Problem #3.4.1**

**Together**, let's compare beef export of different states using a bar plot. Adjust the size of the plot so that the graph and its labels are legible.



**NOTE**: To use a DataFrame for a graph, here is the syntax:
```
plt.bar(DF_NAME['x_variable'],export_df['y_variable'])
```

---

#### **Now it's your turn! Try Problem #3.4.2 on your own.**

---

#### **Problem #3.4.2**

**Independently**, compare the export of corn from different states using a bar plot. Make sure you adjust the size of the plot.

---
# **Lab 1: Part II - Review of Linear Regression with sklearn**
---
### **Description**
This lab provides a comprehensive overview of implementing and evaluating Linear Regression with sklearn.

<br>

### **Lab Structure**
**Part 1**: [Predicting Wine Quality](#iip1)

**Part 2**: [Predicting CO2 Emissions](#iip2)



<br>

### **Learning Objectives**
 By the end of this lab, we will:
* Understand basic pandas commands for EDA.

* Understand basic matplotlib commands for Data Visualization.


<br>


### **Resources**
* [EDA with pandas Cheat Sheet](https://docs.google.com/document/d/1FFoqw45P-kuoq912ARP4qfdGeLTqoq73_qjZThPp2_8/edit?usp=drive_link)

* [Data Visualization with matplotlib Cheat Sheet](https://docs.google.com/document/d/1YlUp6ll81qOyDpU1OWzE-SPxQ3hnF5C9ukLRL_6PYKE/edit?usp=drive_link)

* [Linear Regression with sklearn Cheat Sheet](https://docs.google.com/document/d/1iVieBynTpoKq1LA0kR-4pqDo6evoW5wvbNyE0wOGhYY/edit?usp=drive_link)


<br>

**Before starting, run the code below to import all necessary functions and libraries.**


In [None]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import matplotlib.pyplot as plt

from sklearn import model_selection
from sklearn import datasets
from sklearn.metrics import *

<a name="iip1"></a>

---
## **Part 1: Predicting Wine Quality**
---

In this part, we will implement a linear regression model aimed at predicting the quality rating of wines based on their chemical properties and characteristics.

<br>

This dataset contains data related to wine properties, including chemical characteristics like acidity, pH, and alcohol content. The target variable (label) represents a quality rating for each wine, which is a quantitative measure of wine quality.




#### **Step #1: Load in Data**

**Run the code below to load the data.**

In [None]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv"
data = pd.read_csv(url, sep=';')

#### **Step #2: Choose your Variables**



In [None]:
inputs = df.drop("quality", axis = 1)
output = df[# COMPLETE THIS CODE

#### **Step #3: Split your Data**


In [None]:
X_train, X_test, y_train, y_test = # COMPLETE THIS CODE

#### **Step #4: Import an ML Algorithm**




In [None]:
# COMPLETE THIS CODE

#### **Step #5: Initialize the Model**


In [None]:
model = # COMPLETE THIS CODE

#### **Step #6: Fit, Test, and Visualize**


In [None]:
model.fit(X_train, # COMPLETE THIS CODE

In [None]:
predictions = # COMPLETE THIS CODE

In [None]:
plt.figure(figsize=(8, 8))
plt.scatter(y_test, predictions)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color = 'black', label="Correct prediction")


plt.xlabel('True Quality', fontsize = 'x-large')
plt.ylabel('Predicted Quality', fontsize = 'x-large')
plt.title("Real vs. Predicted Wine Quality", fontsize = 'x-large')
plt.legend()

plt.show()

#### **Step #7: Evaluate**

Let's evaluate this model and put it to the test! Specifically, evaluate the model using our standard regression metrics: $R^2$, MSE, and MAE.


In [None]:
# COMPLETE THIS CODE

#### **Step #8: Apply your Model**

You are provided with data from two new wines, and you want to assess the predicted quality ratings for each of them. The goal is to determine whether either wine is likely to have a higher quality rating based on the model's predictions.

Here is the data for the two wines:

**Wine 1:**
* Fixed Acidity = 12.5
* Volatile Acidity = 0.3
* Citric Acid = 0.6
* Residual Sugar = 1.2
* Chlorides = 0.07
* Free Sulfur Dioxide = 15.0
* Total Sulfur Dioxide = 50.0
* Density = 0.998
* pH = 3.2
* Sulphates = 0.68
* Alcohol Content = 11.5

<br>

**Wine 2:**

* Fixed Acidity = 13.2
* Volatile Acidity = 0.28
* Citric Acid = 0.45
* Residual Sugar = 2.0
* Chlorides = 0.09
* Free Sulfur Dioxide = 12.0
* Total Sulfur Dioxide = 65.0
* Density = 0.995
* pH = 3.3
* Sulphates = 0.55
* Alcohol Content = 12.0

You will use your linear regression model to predict the quality ratings for these wines and assess their relative quality based on the predictions.

##### **1. Predict the quality of Wine 1**


In [None]:
# COMPLETE THIS CODE

##### **2. Predict the quality of Wine 2**

In [None]:
# COMPLETE THIS CODE

<a name="iip2"></a>

---
## **Part 2: Predicting C02 Emissions**
---

Using the CO2 Emissions dataset, do the following:
* Build a model that will predict the CO2 emissions of a car;
* Predict the CO2 emissions of a car with a specific volume and weight.

<br>

Since 1970, CO2 emissions have increased by nearly 90%. These elevated CO2 levels cause poor air quality and contribute to climate change. Globally, cars and other transportation vehicles are responsible for about 29% of overall CO2 emissions. This CO2 emissions dataset is a collection of data from cars that contains information on the car's make, model, volume, weight, and how much CO2 it emits.

The features are as follows:
* `Car`: name of car brand
* `Model`: name of car model
* `Volume`: engine size (in cm^3)
* `Weight`: weight of car (in kg)
* `CO2`: amount of CO2 emitted (in g/km)

#### **Step #1: Load the data**

In [None]:
url = "https://raw.githubusercontent.com/the-codingschool/TRAIN/main/emissions/car_emissions.csv"

cars_df = pd.read_csv(url)
cars_df.head()

#### **Step #2: Decide independent and dependent variables**

We are going to use `Volume` and `Weight` as our independent variables for predicting `CO2` emissions.



In [None]:
features = # COMPLETE THIS CODE
labels = # COMPLETE THIS CODE

#### **Step #3: Split data into training and testing data**


In [None]:
# COMPLETE THIS CODE

#### **Step #4: Import your algorithm**


In [None]:
# COMPLETE THIS CODE

#### **Step #5: Initialize your model and set hyperparameters**



In [None]:
# COMPLETE THIS CODE

#### **Step #6: Fit, Test, and Visualize**


In [None]:
model.fit(X_train, # COMPLETE THIS CODE

In [None]:
predictions = # COMPLETE THIS CODE

In [None]:
# VISUALIZE THE TRUE VS. PREDICTED VALUES

#### **Step #7: Evaluate**

Let's evaluate this model and put it to the test! Specifically, evaluate the model using our standard regression metrics: $R^2$, MSE, and MAE.


#### **Step #8: Use the model**

Using the model we created, predict the CO2 emissions of two cars:

* **Car 1:** Volume is 800 cm^3 and weight is 1020 kg

* **Car 2:**  Volume is 1020 cm^3 and weight is 800 kg

<br>

**NOTE**: You must create a dataframe containing with the information of the new cars:

```python
new_car_data = pd.DataFrame(new_car_data_here, columns = ["Volume", "Weight"])
```

In [None]:
# COMPLETE THIS CODE

---
# **[OPTIONAL] Additional Practice Problems**
---

### **Description**
This optional homework will provide additional practice with data exploration and visualization, as well as linear regression modeling with sklearn.

<br>

### **Structure**
**Part 1**: [Exploratory Data Analysis](#hwp1)

**Part 2**: [Data Visualizations](#hwp2)

>  **Part 2.1**: [Scatter Plots](#hwp2.1)
>
>  **Part 2.2**: [Line Plots](#hwp2.2)
>
>  **Part 2.3**: [Bar Plots](#hwp2.3)

**Part 3**: [Linear Regression](#hwp3)

<br>


### **Resources**
* [EDA with pandas Cheat Sheet](https://docs.google.com/document/d/1FFoqw45P-kuoq912ARP4qfdGeLTqoq73_qjZThPp2_8/edit?usp=drive_link)

* [Data Visualization with matplotlib Cheat Sheet](https://docs.google.com/document/d/1YlUp6ll81qOyDpU1OWzE-SPxQ3hnF5C9ukLRL_6PYKE/edit?usp=drive_link)

* [Linear Regression with sklearn Cheat Sheet](https://docs.google.com/document/d/1iVieBynTpoKq1LA0kR-4pqDo6evoW5wvbNyE0wOGhYY/edit?usp=drive_link)

<br>

**Before starting, run the code below to import all necessary functions and libraries.**


In [None]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import matplotlib.pyplot as plt

from sklearn import model_selection
from sklearn import datasets
from sklearn.metrics import *

<a name="hwp1"></a>

---
## **Part 1: Exploratory Data Analysis**
---

#### **Problem #1.1**
This dataset contains historical statistics for NBA (National Basketball Association) players, sourced from Basketball-Reference.com. The data includes a wide range of metrics from basic statistics like games played and minutes played to more advanced statistics like player efficiency ratings. While the dataset is rich and detailed, we are only focusing on a subset of the available columns to introduce you to the basics of data exploration and manipulation.

<br>

**Even if you're not familiar with basketball, understanding the data columns should still be relatively straightforward. Here's what each column we're using means:**

- `player_id`: A unique ID assigned by Basketball-Reference.com to each player.

- `name_common`: The name of the basketball player.

- `year_id`: This refers to the NBA season year. For example, the 2019-2020 NBA season would be represented as "2000".

- `age`: The age of the player as of February 1 of that season.

- `team_id`: The abbreviation for the team that the player played for during that season. Each NBA team has a unique abbreviation, like 'LAL' for the Los Angeles Lakers.

- `G`: Games Played - The number of games the player participated in during that season.

- `Min`: Minutes Played - The total number of minutes the player was on the court during the season.

- `MPG`: Minutes Per Game - This is the average number of minutes the player was on the court per game during the season. It's calculated as Min divided by G.

- `FT%`: Free Throw Percentage - This is the percentage of free throws the player made successfully. A free throw is an opportunity given to a player to score one point, unopposed, from a position 15 feet from the basket. It's calculated as Free Throws Made divided by Free Throws Attempted.

<br>

**Run the code cell below to load the data.**

In [None]:
url = 'https://raw.githubusercontent.com/fivethirtyeight/nba-player-advanced-metrics/master/nba-data-historical.csv'
nba_df = pd.read_csv(url)
nba_df = nba_df[['player_id', 'name_common', 'year_id', 'age', 'team_id', 'G', 'Min', 'MPG', 'FT%']]
nba_df = nba_df.dropna()
nba_df

#### **Problem #1.2**

How many players are included in this dataset?

#### **Problem #1.3**

How many columns are in this DataFrame?

#### **Problem #1.4**
How many columns contain numerical data?

#### **Problem #1.5**

How many different NBA teams (`team_id`) are included in the dataset?

#### **Problem #1.6**

What is the most common `age` among all players in the dataset?


#### **Problem #1.7**

Complete the code below to output players above the age of 35.

In [None]:
older_players = nba_df[# COMPLETE THIS LINE OF CODE

older_players['player_id']


#### **Problem #1.8**

Extract the following columns: `player_id`, `age`, `FT%`

####**Problem #1.9**
Identify players with a Free-Throw Percentage (`FT%`) greater than 90%.

#### **Problem #1.10**

What is the average age of the players in the dataset?

#### **Problem #1.11**

What is the median value for the Minutes Per Game (`MPG`) across all players?

#### **Problem #1.12**

Calculate the sum of minutes played (`Min`) for all players in the dataset.

<a name="hwp2"></a>

---
## **Part 2: Data Visualizations**
---

<a name="hwp2.1"></a>

---
### **Part 2.1: Scatter Plots**
---

#### **Problem #2.1.1**

Create a scatter plot given the array `x_range` and array `y_range`. Add a title called "Random Variable Vs. Random Variable" to the graph. And add x- and y-labels that say "Random X" and "Random Y", respectively.

In [None]:
x_range = np.random.randint(400, size=50)
y_range = np.random.randint(400, size=50)

# add scatter plot

#### **Problem #2.1.2**

Given the following scatter plot, add the following labels and title:
* `Title`: "Distance vs Workout Duration"
* `X-axis`: "Distance (km)"
* `Y-axis`: "Workout Duration (min)"

In [None]:
workout_df = pd.DataFrame({"date": ["10/17/21", "11/04/21", "11/18/21", "11/23/21", "11/28/21", "11/29/21"],
           "distance_km": [4.3, 1.9, 1.9, 1.9, 2.3, 2.8],
           "duration_min": [21.58, 9.25, 9.0, 8.93, 11.94, 14.05],
           "delta_last_workout":[1, 18, 14, 5, 5, 1],
           "day_category": [0, 1, 1, 0, 0, 0]})

# creating scatter
x = workout_df['distance_km']
y = workout_df['duration_min']

#add code

plt.scatter(x, y)
plt.show()

#### **Problem #2.1.3**

Create a scatter plot for the following data. Make the title and labels the following:
* `Title`: "Age vs. height in teenagers"
* `X-axis`: "Age"
* `Y-axis`: "Height (in)"

In [None]:
age = [14, 14, 13, 18, 17, 20, 17, 16, 19, 19]
height = [65, 68, 58, 61, 64, 75, 67, 69, 71, 63] #in inches
# Scatter Plot

<a name="hwp2.2"></a>

---
### **Part 2.2: Line Plots**
---

#### **Problem #2.2.1**

Create a line plot for the following data. Add a title called "Bike Rideshare Activity" with x-axis and y-axis labels called "Month" and "Bike Trips", respectively.

In [None]:
months = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
trips = [300, 358, 521, 574, 783, 1549, 1776, 1920, 1714, 1234, 703, 438]

# Line Plot

#### **Problem #2.2.2**

Create a line plot for the following data. Add a title called "Unemployment rate over the years" with x-axis and y-axis labels called "Year" and "Unemployment", respectively.

In [None]:
Year = [1920,1930,1940,1950,1960,1970,1980,1990,2000,2010]
Unemployment_Rate = [9.8,12,8,7.2,6.9,7,6.5,6.2,5.5,6.3]

In [None]:
# Line Plot


#### **Problem #2.2.3**

The information below contains data related to how many people have visited Disney Parks globally. Create a line plot to show how the number of visits have changed between 2017 to 2020. Make sure you add in helpful labels and a title.

**Note:** Visitors is in millions. For example, in 2017, 150 million people visited Disney Parks globally.

In [None]:
year = [2017, 2018, 2019, 2020]
visitors = [150, 157, 155.991, 43.525]

In [None]:
# Line Plot


<a name="hwp2.3"></a>

---
### **Part 2.3: Bar Plots**
---

#### **Problem #2.3.1**

Now, plot the same data as above, but using a bar plot. In some cases, it may not be clear which plot will be best until you see your options!

In [None]:
year = ['2017', '2018', '2019', '2020']
visitors = [150, 157, 155.991, 43.525]

In [None]:
# Bar Plot


#### **Problem #2.3.2**

Create a bar plot for the following data. Add a title called "Favorite Types of Drinks" with x-axis and y-axis labels called "Drink" and "Number of People", respectively.

In [None]:
drinks = ["water", "tea", "coffee", "juice", "soda"]
people = [12, 5, 17, 15, 9]

# Bar Plot

#### **Problem #2.3.3**

Create a bar plot using the DataFrame below. Add labels: `Innovative companies` for the title, `Countries` for the x-axis label, and `Number of Companies` for the y-axis label.

In [None]:
companies_df = pd.DataFrame({"countries": ["USA", "South Korea", "China", "Japan", "Germany", "Netherlands", "India", "France", "London", "Switzerland", "Sweden", "Italy"],
             "companies": [25, 2, 3, 3, 6, 1, 3, 1, 1, 2, 2, 1]})

companies_df.head()

# add code

<a name="hwp2.4"></a>

---
### **Part 2.4: Improving Visualizations [OPTIONAL]**
---

#### **Problem #2.4.1**

Using the following data, create a line plot.

**Hint:** Your graph should have three distinct lines corresponding to `y1`, `y2`, and `y3`.

In addition:
* Make sure each line is a different color
* Make `y1` a dashed line
* Add a grid to the background and make the background color black
* Add a legend

In [None]:
# x axis values
x = [1,2,3,4,5]
# corresponding y axis values
y1 = [2,4,6,8,10] # y = 2x
y2 = [0.5,1,1.5,2,2.5] # y = 0.5x
y3 = [1,4,9,16,25] # y = x^2

# COMPLETE THE REST OF THE CODE

#### **Problem #2.4.2**

Using the following data, create a scatter plot. In addition:
* Make the data points green
* Change the transparency to 0.6
* Make the x-label `Temperature (Fahrenheit)` and in the font `fantasy`
* Make the y-label `Number of People` and in the font `fantasy`
* Make the title `Number of People at the Beach` and in the font `fantasy`

In [None]:
# x axis values
x = [87, 94, 98, 102, 96, 90, 92, 93, 85, 82, 96, 80, 90, 91]
# corresponding y axis values
y = [204, 375, 522, 731, 439, 302, 317, 346, 268, 197, 649, 158, 327, 353]

# COMPLETE THE REST OF THE CODE

#### **Problem #2.4.3**

Using the following data, create a bar plot. In addition:
* Make each bar's color the same as the color name
* Make the x-label `Favorite Color` and with a font size of 12
* Make the y-label `Number of People` and with a font size of 12
* Make the title `Number of People vs Favorite Color` and with a font size of 20
* Make the bar width 0.6

In [None]:
# x axis values
x = [1,2,3,4,5]
# corresponding y axis values
y1 = [2,4,6,8,10]
y2 = [0.5,1,1.5,2,2.5]
y3 = [1,4,9,16,25]

# COMPLETE THE REST OF THE CODE

#### **Problem #2.4.4**

Using the following data, create a line plot. In addition:
* Make line y1 brown and dashed, and make line y2 pink
* Add a grid to the background
* Add a legend
* Make the title `X vs Y` in font `monospace` and in size 18


In [None]:
# x axis values
x = [1,2,3,4,5,6,7]
# corresponding y axis values
y1 = [9, 4, 6, 8, 22, 17, 13]
y2 = [3, 5, 8, 12, 17, 23, 30]

# COMPLETE THE REST OF THE CODE

#### **Problem #2.4.5**

Create a bar plot for the following data with the following:
* One bar showing the number of females in the dataset and another bar showing the number of males in the dataset.
* Bars labeled 'Female' and 'Male'.
* The y-axis labeled 'Number in Dataset' with extra large font.
* A title called "Number of Males and Females in the Dataset" with extra large font.

<br>

**Hint**: You will need to use pandas functions to get the count of males and females in the data frame.

In [None]:
url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vSa0metcKBFqn-MHLn05vVGWONMlzljcWa-xIM1wJPXIa5kbrmIzGqmWcMh8eKG_ntByF9qqn6Mx3MT/pub?gid=1052859518&single=true&output=csv'
df = pd.read_csv(url)
df.head()

# COMPLETE THE REST OF THE CODE

#### **Problem #2.4.6**

Create a *grouped* bar plot for the following data with the following:
* One bar graph showing the number of females with heart attacks and without in the dataset.
* This bar graph should be labeled 'Female' for the legend.
* Another bar graph showing the number of males with heart attacks and without in the dataset.
* This bar graph should be labeled 'Male' for the legend.
* Both bar graphs should be located on the x-axis and given a width to make the graph readable.
* The y-axis labeled 'Number in Dataset' with extra large font.
* A title called "Breakdown of Heart Attacks by Sex" with extra large font.

<br>

**Hint**: You will need to use pandas functions and comparisons to get the count of males and females with and without heart attacks in the data frame.

In [None]:
df_female = df[# COMPLETE THIS LINE
df_male = df[# COMPLETE THIS LINE

# COMPLETE THE REST OF THIS CODE

plt.xticks(ticks = [0, 1], labels = ['No Heart Attack', 'Heart Attack'], fontsize = 'x-large')

plt.# COMPLETE THIS LINE

#### **Comment on this Dataset**

This is an unfortunately common case of biased data, specifically *unbalanced data*, leading to potentially harmful results. We could attempt removing `'Sex'` as a feature to blind any ML models to the sex of the patient. However, bias often runs deeper than the most superficial variables and may be correlated with others in ways that humans and especially advanced ML algorithms can still pick up on. Consider some of the following ideas for improving on these results:

* Using statistical methods for balancing the data. For instance, upsampling and downsampling are common first approaches to tackling this problem.

* Find a dataset that is more balanced to begin with. In an ideal world, we would make sure that the data is balanced (representative) upon collection.

<a name="hwp3"></a>

---
## **Part 3: Linear Regression**
---

Using the  dataset, do the following:
* Build a model that can predict the total number of bike rentals.
* Use a different subset of features to build another model to predict the total number of bike rentals and compare the results.

<br>

The Bike Sharing dataset contains information about hourly bike rental data spanning two years, provided by a bike-sharing system in Washington, D.C. The dataset includes factors such as weather, date, time, and user information.

The features are as follows:
* `instant`: A unique identifier for each record in the dataset.
* `dteday`: The date of the bike rental in the format yyyy-mm-dd.
* `season`: The season of the year (1: spring, 2: summer, 3: fall, 4: winter).
* `yr`: The year (0: 2011, 1: 2012).
* `mnth`: The month of the year (1 to 12).
* `hr`: The hour of the day (0 to 23).
* `holiday`: A binary indicator of whether it is a holiday or not (0: not a holiday, 1: holiday).
* `weekday`: The day of the week (0: Sunday, 1: Monday, ..., 6: Saturday).
* `workingday`: A binary indicator of whether it is a working day or not (0: weekend or holiday, 1: working day).
* `weathersit`: The weather situation (1: clear, 2: misty/foggy, 3: light rain/snow, 4: heavy rain/snow).
* `temp`: The temperature in Celsius.
* `atemp`: The "feels like" temperature in Celsius.
* `hum`: The humidity level.
* `windspeed`: The wind speed.
* `casual`: The count of casual bike rentals.
* `registered`: The count of registered bike rentals.
* `cnt`: The total count of bike rentals (casual + registered).

#### **Step #1: Load in Data**

**Run the code below to load the data.**

In [None]:
# Import required libraries
import urllib.request
import zipfile

# Download the zip file and extract the CSV file(s)
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00275/Bike-Sharing-Dataset.zip"
filename = "Bike-Sharing-Dataset.zip"
urllib.request.urlretrieve(url, filename)
with zipfile.ZipFile(filename, "r") as zip_ref:
    zip_ref.extractall()

# Read the CSV file(s) into Pandas dataframes
hour_df = pd.read_csv("hour.csv")
day_df = pd.read_csv("day.csv")

# Combine the two dataframes into a single dataframe
bikes_df = pd.concat([hour_df, day_df], ignore_index=True).drop(columns = ['instant', 'dteday', 'casual', 'hr'])


bikes_df.head()

#### **Step #2: Choose your Variables**

We are using all available features to predict `cnt`.


In [None]:
inputs = # COMPLETE THIS CODE
output = # COMPLETE THIS CODE

#### **Step #3: Split your Data**


In [None]:
X_train, X_test, y_train, y_test = # COMPLETE THIS CODE

#### **Step #4: Import an ML Algorithm**




In [None]:
# COMPLETE THIS CODE

#### **Step #5: Initialize the Model**


In [None]:
model = # COMPLETE THIS CODE

#### **Step #6: Fit, Test, and Visualize**


In [None]:
model.fit(X_train, # COMPLETE THIS CODE

In [None]:
predictions = # COMPLETE THIS CODE

In [None]:
plt.figure(figsize=(8, 8))
plt.scatter(y_test, predictions)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color = 'black', label="Correct prediction")


plt.xlabel('True Count', fontsize = 'x-large')
plt.ylabel('Predicted Count', fontsize = 'x-large')
plt.title("Real vs. Predicted Count", fontsize = 'x-large')
plt.legend()

plt.show()

#### **Step #7: Evaluate**

Let's evaluate this model and put it to the test! Specifically, evaluate the model using our standard regression metrics: $R^2$, MSE, and MAE.


In [None]:
# COMPLETE THIS CODE

---

# End of Notebook

© 2024 The Coding School, All rights reserved