<a href="https://colab.research.google.com/github/zmy2338/Machine-Learning-AWS/blob/main/Copy_of_%5BSTUDENT_VERSION%5D_AWS_Part_II_Day_1_Lab_Notebook_v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Lab 1: Review of the Basics I**
---
### **Description**
This lab provides a comprehensive overview of exploratory data analysis (EDA) techniques using Python's pandas library for data manipulation and analysis. Additionally, it explores data visualization using the matplotlib library. Throughout the notebook, you'll review how to load and manipulate datasets effectively with pandas commands and leverage matplotlib to create insightful visualizations that aid in uncovering patterns, trends, and insights within the data.



### **Lab Structure**
* **Part 1**: [Exploratory Data Analysis Review](#p1)

  >  **Part 1.1**: [Basic Commands](#p1.1)

  >  **Part 1.2**: [Further Exploration](#p1.2)

* **Part 2**: [Data Visualization Review](#p2)

  >  **Part 2.1**: [Scatter Plots](#p2.1)

  >  **Part 2.2**: [Line Plots](#p2.2)

  >  **Part 2.3**: [Bar Plots](#p2.3)

* **Part 3**: [[OPTIONAL] Improving Visualizations](#p3)
  >  **Part 3.1**: [Improving Scatter Plots](#p3.1)

  >  **Part 3.2**: [Improving Line Plots](#p3.2)

  >  **Part 3.3**: [Improving Bar Plots](#p3.3)

  >  **Part 3.4**: [Enhancing Plot Aesthetics](#p3.4)

* **Part 4**: [[ADDITIONAL PRACTICE] Improving Visualizations](#p4)

* **Part 5**: [Modeling with sklearn](#p5)



<br>

### **Learning Objectives**
 By the end of this lab, we will:
* Understand basic pandas commands for EDA.

* Understand basic matplotlib commands for Data Visualization.

* Understand how to implement and evaluate Linear Regression, K-Nearest Neighbors, and Logistic Regression with sklearn.


<br>


### **Resources**
* [EDA with pandas](https://docs.google.com/document/d/19_Vzr_sxVOKxlvcDnCs15J-Hy6fidXDBtj16jSIhllA/edit?usp=sharing)
* [Data Visualization with matplotlib](https://docs.google.com/document/d/1K9sb2lpcLVkQkdmF-JlCJx0k0mIiTsLk6ugsnv54mew/edit?usp=sharing)

* [Linear Regression](https://docs.google.com/document/d/1Ul6ILmP-UZ9LT8iyUxM_x9uRJ_LKHDfwBPIWIQU-d3g/edit?usp=sharing)

* [K-Nearest Neighbors](https://docs.google.com/document/d/1Z0Uk43jYmRPHgzIyg0s1YIU6hNIw43PFFtqbl8lbj64/edit?usp=sharing)

* [Logistic Regression](https://docs.google.com/document/d/1dAlAlRI1YAKa2Od08TJ9Z7JUysxcUucZXvNMVMvLA4w/edit?usp=sharing)

<br>

**Before starting, run the code below to import all necessary functions and libraries.**


In [None]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn import datasets
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import classification_report

<a name="p1"></a>

---
## **Part 1: Exploratory Data Analysis Review**
---




<a name="p1.1"></a>

---
### **Part 1.1: Basic Commands**
---


**Run the code cell below to create the DataFrame.**

In [None]:
df = pd.DataFrame({'U.S. State': ['California', 'Florida', 'Indiana', 'Texas', 'Pennsylvania'],
        'Population (in millions)': [38, 21, 6.5, 28, 13],
        'Capitol': ['Sacramento', 'Tallahassee', 'Indianapolis', 'Austin', 'Harrisburg'],
        'GDP ($ in billions)': [3700, 1070, 352, 1876, 726]})

#### **Problem #1.1.1**

**Together**, let's inspect what `.head()` tells us about this DataFrame.

##### **Solution**

#### **Problem #1.1.2**

**Together**, let's determine what datatype `Population (in millions)` is.

##### **Solution**

#### **Problem #1.1.3**

**Together**, let's print all of the unique values for `GDP ($ in billions)`.

##### **Solution**

---

#### **Now it's your turn! Try Problems #1.1.4 - 1.1.7 on your own.**

---

#### **Problem #1.1.4**

**Independently**, determine the column names in the dataset.

##### **Solution**

#### **Problem #1.1.5**

**Independently**, determine the highest `GDP ($ in billions)` in the dataset.

##### **Solution**

#### **Problem #1.1.6**

**Independently**, determine which states are included in this dataset.

##### **Solution**

#### **Problem #1.1.7**

**Independently**, determine the range of GDP values among the states?

##### **Solution**

---

<center>

#### **Wait for Your Instructor to Continue**

---

<a name="p1.2"></a>

---
### **Part 1.2: Further Exploration**
---



#### **Problem #1.2.1**

**Independently**, determine the average `Population (in millions)` size among the U.S. states in the dataset.

##### **Solution**

#### **Problem #1.2.2**

**Independently,** explore rows 4 and 5. What are the U.S. States listed?

##### **Solution**

#### **Problem #1.2.3**

**Independently**, determine the total `Population (in millions)` across all states.

##### **Solution**

#### **Problem #1.2.4**

**Independently**, determine the average `Population (in millions)` of the states.

##### **Solution**

#### **Problem #1.2.5**

**Independently**, determine the `Population (in millions)` for the 3rd state in the dataset.

##### **Solution**

#### **Problem #1.2.6**

**Independently**, determine how many states have a population greater than 20 million.

##### **Solution**

#### **Problem #1.2.7**

**Independently**, explore the last row in the dataset.

##### **Solution**

### **[Challenge Question] Problem #1.2.8**

**Independently**, determine the average `GDP per capita` for the states.

**HINT:** Divide `GDP per capita` by `Population (in millions)`.

##### **Solution**

---

<center>

#### **Wait for Your Instructor to Continue**

---

<a name="p2"></a>

---
## **Part 2: Data Visualization Review**
---

**Run the cell below to load in the data**

In [None]:
url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vS9jPkeKJ8QUuAl-fFdg3nJPDP6vx1byvIBl4yW8UZZJ9QEscyALJp1eywKeAg7aAffwdKP63D9osF1/pub?gid=169291584&single=true&output=csv"
movie_df = pd.read_csv(url)

movie_df.drop_duplicates(inplace=True)

mean_runtime = movie_df['Runtime'].mean()
movie_df['Runtime'] = movie_df['Runtime'].fillna(mean_runtime)

movie_df = movie_df.rename(columns = {"Runtime": "Runtime (min)"})
movie_df = movie_df.astype({"Runtime (min)": "int64"})

movie_df.head()

Unnamed: 0,Series_Title,Released_Year,Runtime (min),Genre,IMDB_Rating,Overview,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,The Shawshank Redemption,1994,142,Drama,9.3,Two imprisoned men bond over a number of years...,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,The Godfather,1972,175,Crime,9.2,An organized crime dynasty's aging patriarch t...,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,The Dark Knight,2008,152,Action,9.0,When the menace known as the Joker wreaks havo...,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
3,The Godfather: Part II,1974,202,Crime,9.0,The early life and career of Vito Corleone in ...,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
4,12 Angry Men,1957,96,Crime,9.0,A jury holdout attempts to prevent a miscarria...,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000


<a name="p2.1"></a>

---
### **Part 2.1: Scatter Plots**
---

#### **Problem #2.1.1**

**Together**, let's create a scatterplot using `Runtime (min)` as the x-axis value and `Gross` as the y-axis value.

Make sure to include a meaningful:
* `Title`: "Gross Money vs. Runtime:
* `X-axis`: "Gross (USD)"
* `Y-axis`: "Runtime (min)"

##### **Solution**

---

#### **Now it's your turn! Try Problem #2.1.2 on your own.**

---

#### **Problem #2.1.2**

**Independently**, create a scatterplot using `Released_Year` as the x-axis value and `Runtime (min)` as the y-axis value.

Make sure to include a meaningful:
* `Title`: "Runtime vs. Released_Year"
* `X-axis`: "Year"
* `Y-axis`: "Runtime (min)"

##### **Solution**

---

<center>

#### **Wait for Your Instructor to Continue**

---

<a name="p2.2"></a>

---
### **Part 2.2: Line Plots**
---

#### **Problem #2.2.1**

**Together**, let's create a line plot using `Runtime (min)` as the x-axis value and `Gross` as the y-axis value.

Make sure to include a meaningful:
* Title, ex: `'Gross Money vs. Runtime'`.
* X-axis label including units `'min'`.
* Y-axis label including units `'USD'`.

<br>

**NOTE**: This is not going to be a particularly helpful graph (the scatter plot is a better choice), but we oftentimes will not know this ahead of time. A lot of EDA and visualizations involves trying a number of things and seeing what is useful.

##### **Solution**

---

#### **Now it's your turn! Try Problem #2.2.2 on your own.**

---

#### **Problem #2.2.2**

**Independently**, create a line plot using `Released_Year` as the x-axis value and `Average Gross in Year` as the y-axis value.

Make sure to include a meaningful:
* Title, ex: `'Average Gross Money vs. Released Year'`.
* X-axis label.
* Y-axis label including units `'USD'`.

In [None]:
mean_gross = movie_df.groupby(# COMPLETE THIS LINE


SyntaxError: ignored

##### **Solution**

---

<center>

#### **Wait for Your Instructor to Continue**

---

<a name="p2.3"></a>

---
### **Part 2.3: Bar Plots**
---

#### **Problem #2.3.1**

**Together**, let's create a bar plot of the number of movies released per year.

Use the DataFrame provided, `movies_per_year` and make sure to include a meaningful:
* Title.
* X-axis label.
* Y-axis label.

In [None]:
movies_per_year = movie_df['Released_Year'].value_counts()

plt.bar(movies_per_year.index, # COMPLETE THIS CODE

##### **Solution**

---

#### **Now it's your turn! Try Problem #2.3.2 on your own.**

---

#### **Problem #2.3.2**

**Independently**, create a bar plot of the number of Dramas released per year.

Use the DataFrame provided, `movies_per_year` and make sure to include a meaningful:
* Title.
* X-axis label.
* Y-axis label.

<br>

**Hint**: Recall that you can use `.loc[CRITERIA, :]` to find all data matching given criteria and the example in Problem #6 for finding the number of movies realeased per year.

In [None]:
# COMPLETE THIS CODE

##### **Solution**

---

<center>

#### **Wait for Your Instructor to Continue**

---

<a name="p3"></a>

---
## **Part 3: [OPTIONAL] Improving Visualizations**
---

In this section, we will explore several ways to improve upon the visuals we learned to make above.

<a name="p3.1"></a>

---
### **Part 3.1: Improving Scatter Plots**
---

#### **Problem #3.1.1**

We are given average temperature values for the months of the year for two cities: `city_A` and `city_B`.

**Independently**, plot each city's average temperatures. We'll need to make two scatter plots.

Make `city_A` markers blue and `city_B` markers red. Add labels and a legend.

From the graph, which city is most likely located in the Northeast?

In [None]:
city_A_temps = [60,65,67, 70, 77, 84, 94, 101, 90, 82, 62]
city_B_temps = [-11, 14, 25, 32, 55, 73, 87, 92, 82, 66, 53]
months = np.arange(1,12)

# COMPLETE THE REST OF THE CODE

##### **Solution**

#### **Problem #3.1.2**

**Independently**, adjust the plot so that `city_A` markers are black and `city_B` markers are green.

In [None]:
city_A_temps = [60,65,67, 70, 77, 84, 94, 101, 90, 82, 62]
city_B_temps = [-11, 14, 25, 32, 55, 73, 87, 92, 82, 66, 53]
months = np.arange(1,12)

# COMPLETE THE REST OF THE CODE

##### **Solution**

---

<center>

#### **Wait for Your Instructor to Continue**

---

<a name="p3.2"></a>

---
### **Part 3.2: Improving Line Plots**
---

#### **Problem #3.2.1**

**Independently**, create a line plot with the following features:


* A dashed line
* A grid




In [None]:
Year = [1920,1930,1940,1950,1960,1970,1980,1990,2000,2010]
Unemployment_Rate = [9.8,12,8,7.2,6.9,7,6.5,6.2,5.5,6.3]

# COMPLETE THE REST OF THE CODE

#####**Solution**

#### **Problem #3.2.2**

**Independently**, using the following data, create a line plot. In addition:
* Make that line dashed and dotted with `"-."`
* Add a grid to the background

In [None]:
# x axis values
x = [1,2,3]
# corresponding y axis values
y = [2,4,1]

# COMPLETE THE REST OF THE CODE

#####**Solution**

---

<center>

#### **Wait for Your Instructor to Continue**

---

<a name="p3.3"></a>

---
### **Part 3.3: Improving Bar Plots**
---

#### **Problem #3.3.1**

**Independently**, make a bar plot with each bar as a different color.

In [None]:
langs = ['English', 'French', 'Spanish', 'Chinese', 'Arabic']
students = [23,17,35,29,12]

# COMPLETE THE REST OF THE CODE

##### **Solution**

#### **Problem #3.3.2**

**Independently**, use the following data to create a simple bar plot. Make all of the bars blue except bar `E`; make bar `E` red.

In [None]:
height = [3, 12, 5, 18, 45]       # y
bars = ['A', 'B', 'C', 'D', 'E']  # x

# COMPLETE THE REST OF THE CODE

#####**Solution**

---

<center>

#### **Wait for Your Instructor to Continue**

---

<a name="p3.4"></a>

---
### **Part 3.4: Enhancing Plot Aesthetics**
---

#### **Run the cell below to import the data for the following problems.**

This dataset contains information on U.S. agricultural exports in 2011.

In [None]:
url = 'https://raw.githubusercontent.com/plotly/datasets/master/2011_us_ag_exports.csv'
export_df = pd.read_csv(url)
export_df.head()

#### **Problem #3.4.1**

**Together**, let's compare beef export of different states using a bar plot. Adjust the size of the plot so that the graph and its labels are legible.



**NOTE**: To use a DataFrame for a graph, here is the syntax:
```
plt.bar(DF_NAME['x_variable'],export_df['y_variable'])
```

##### **Solution**

---

#### **Now it's your turn! Try Problem #3.4.2 on your own.**

---

#### **Problem #3.4.2**

**Independently**, compare the export of corn from different states using a bar plot. Make sure you adjust the size of the plot.

##### **Solution**

---

<center>

#### **Wait for Your Instructor to Continue**

---

<a name="p4"></a>

---
## **[ADDITIONAL PRACTICE] Part 4: Improving Visualizations**
---

#### **Problem #4.1**

Using the following data, create a line plot.

**Hint:** Your graph should have three distinct lines corresponding to `y1`, `y2`, and `y3`.

In addition:
* Make sure each line is a different color
* Make `y1` a dashed line
* Add a grid to the background and make the background color black
* Add a legend

In [None]:
# x axis values
x = [1,2,3,4,5]
# corresponding y axis values
y1 = [2,4,6,8,10] # y = 2x
y2 = [0.5,1,1.5,2,2.5] # y = 0.5x
y3 = [1,4,9,16,25] # y = x^2

# COMPLETE THE REST OF THE CODE

##### **Solution**

#### **Problem #4.2**

Using the following data, create a scatter plot. In addition:
* Make the data points green
* Change the transparency to 0.6
* Make the x-label `Temperature (Fahrenheit)` and in the font `fantasy`
* Make the y-label `Number of People` and in the font `fantasy`
* Make the title `Number of People at the Beach` and in the font `fantasy`

In [None]:
# x axis values
x = [87, 94, 98, 102, 96, 90, 92, 93, 85, 82, 96, 80, 90, 91]
# corresponding y axis values
y = [204, 375, 522, 731, 439, 302, 317, 346, 268, 197, 649, 158, 327, 353]

# COMPLETE THE REST OF THE CODE

##### **Solution**

#### **Problem #4.3**

Using the following data, create a bar plot. In addition:
* Make each bar's color the same as the color name
* Make the x-label `Favorite Color` and with a font size of 12
* Make the y-label `Number of People` and with a font size of 12
* Make the title `Number of People vs Favorite Color` and with a font size of 20
* Make the bar width 0.6

In [None]:
# x axis values
x = [1,2,3,4,5]
# corresponding y axis values
y1 = [2,4,6,8,10]
y2 = [0.5,1,1.5,2,2.5]
y3 = [1,4,9,16,25]

# COMPLETE THE REST OF THE CODE

#####**Solution**

#### **Problem #4.4**

Using the following data, create a line plot. In addition:
* Make line y1 brown and dashed, and make line y2 pink
* Add a grid to the background
* Add a legend
* Make the title `X vs Y` in font `monospace` and in size 18


In [None]:
# x axis values
x = [1,2,3,4,5,6,7]
# corresponding y axis values
y1 = [9, 4, 6, 8, 22, 17, 13]
y2 = [3, 5, 8, 12, 17, 23, 30]

# COMPLETE THE REST OF THE CODE

#####**Solution**

#### **Problem #4.5**

Create a bar plot for the following data with the following:
* One bar showing the number of females in the dataset and another bar showing the number of males in the dataset.
* Bars labeled 'Female' and 'Male'.
* The y-axis labeled 'Number in Dataset' with extra large font.
* A title called "Number of Males and Females in the Dataset" with extra large font.

<br>

**Hint**: You will need to use pandas functions to get the count of males and females in the data frame.

In [None]:
url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vSa0metcKBFqn-MHLn05vVGWONMlzljcWa-xIM1wJPXIa5kbrmIzGqmWcMh8eKG_ntByF9qqn6Mx3MT/pub?gid=1052859518&single=true&output=csv'
df = pd.read_csv(url)
df.head()

# COMPLETE THE REST OF THE CODE

##### **Solution**

#### **Problem #4.6**

Create a *grouped* bar plot for the following data with the following:
* One bar graph showing the number of females with heart attacks and without in the dataset.
* This bar graph should be labeled 'Female' for the legend.
* Another bar graph showing the number of males with heart attacks and without in the dataset.
* This bar graph should be labeled 'Male' for the legend.
* Both bar graphs should be located on the x-axis and given a width to make the graph readable.
* The y-axis labeled 'Number in Dataset' with extra large font.
* A title called "Breakdown of Heart Attacks by Sex" with extra large font.

<br>

**Hint**: You will need to use pandas functions and comparisons to get the count of males and females with and without heart attacks in the data frame.

In [None]:
df_female = df[# COMPLETE THIS LINE
df_male = df[# COMPLETE THIS LINE

# COMPLETE THE REST OF THIS CODE

plt.xticks(ticks = [0, 1], labels = ['No Heart Attack', 'Heart Attack'], fontsize = 'x-large')

plt.# COMPLETE THIS LINE

##### **Solution**

#### **Comment on this Dataset**

This is an unfortunately common case of biased data, specifically *unbalanced data*, leading to potentially harmful results. We could attempt removing `'Sex'` as a feature to blind any ML models to the sex of the patient. However, bias often runs deeper than the most superficial variables and may be correlated with others in ways that humans and especially advanced ML algorithms can still pick up on. Consider some of the following ideas for improving on these results:

* Using statistical methods for balancing the data. For instance, upsampling and downsampling are common first approaches to tackling this problem.

* Find a dataset that is more balanced to begin with. In an ideal world, we would make sure that the data is balanced (representative) upon collection.

<a name="p5"></a>

---
## **Part 5: Modeling with sklearn**
---

### **Part 5.1: Linear Regression**


---

This dataset contains data related to wine properties, including chemical characteristics like acidity, pH, and alcohol content. The target variable (label) represents a quality rating for each wine, which is a quantitative measure of wine quality.


In Part 5.1, we will implement a linear regression model aimed at predicting the quality rating of wines based on their chemical properties and characteristics.

#### **Step #1: Load in Data**

**Run the code below to load the data.**

In [None]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv"
data = pd.read_csv(url, sep=';')

#### **Step #2: Choose your Variables**



In [None]:
inputs = df.drop("quality", axis = 1)
output = df[# COMPLETE THIS CODE

##### **Solution**

#### **Step #3: Split your Data**


In [None]:
X_train, X_test, y_train, y_test = # COMPLETE THIS CODE

##### **Solution**

#### **Step #4: Import an ML Algorithm**




##### **Solution**

#### **Step #5: Initialize the Model**


In [None]:
model = # COMPLETE THIS LINE

##### **Solution**

#### **Step #6: Fit, Test, and Visualize**


In [None]:
model.fit(X_train, # COMPLETE THIS LINE

In [None]:
predictions = # COMPLETE THIS LINE

In [None]:
plt.figure(figsize=(8, 8))
plt.scatter(y_test, predictions)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color = 'black', label="Correct prediction")


plt.xlabel('True Quality', fontsize = 'x-large')
plt.ylabel('Predicted Quality', fontsize = 'x-large')
plt.title("Real vs predicted wine quality", fontsize = 'x-large')
plt.legend()

plt.show()

##### **Solution**

#### **Step #7: Evaluate**

Let's evaluate this model and put it to the test! Specifically, evaluate the model using our standard regression metrics: $R^2$, MSE, and MAE.


###### **Solution**

#### **Step #8: Apply your Model**

You are provided with data from two new wines, and you want to assess the predicted quality ratings for each of them. The goal is to determine whether either wine is likely to have a higher quality rating based on the model's predictions.

Here is the data for the two wines:

**Wine 1:**
* Fixed Acidity = 12.5
* Volatile Acidity = 0.3
* Citric Acid = 0.6
* Residual Sugar = 1.2
* Chlorides = 0.07
* Free Sulfur Dioxide = 15.0
* Total Sulfur Dioxide = 50.0
* Density = 0.998
* pH = 3.2
* Sulphates = 0.68
* Alcohol Content = 11.5

**Wine 2:**

* Fixed Acidity = 13.2
* Volatile Acidity = 0.28
* Citric Acid = 0.45
* Residual Sugar = 2.0
* Chlorides = 0.09
* Free Sulfur Dioxide = 12.0
* Total Sulfur Dioxide = 65.0
* Density = 0.995
* pH = 3.3
* Sulphates = 0.55
* Alcohol Content = 12.0

You will use your linear regression model to predict the quality ratings for these wines and assess their relative quality based on the predictions.

##### **1. Predict the quality of Wine 1**


###### **Solution**

##### **2. Predict the quality of Wine 2**

###### **Solution**

### **Part 5.2: K-Nearest Neighbors**


---

This dataset contains crucial information related to breast cancer, including various features such as mean radius, mean texture, and mean smoothness. The target variable (label) indicates the diagnosis, distinguishing between malignant and benign cases.

In Part 5.2, we will implement a K-Nearest Neighbors (KNN) model aimed at predicting the diagnosis of breast cancer samples. The goal is to classify new samples as either malignant or benign based on their feature characteristics.

#### **Step #1: Load in Data**

**Run the code below to load the data.**

In [None]:
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
selected_features = ["mean radius", "mean texture", "mean perimeter", "mean area", "mean smoothness", "mean compactness", "mean concavity", "mean concave points", "mean symmetry", "mean fractal dimension"]
df = pd.DataFrame(data.data, columns=data.feature_names)
df = df[selected_features]
df['Target'] = data.target

#### **Step #2: Choose your Variables**



In [None]:
inputs = # COMPLETE THIS CODE
output = # COMPLETE THIS CODE

##### **Solution**

#### **Step #3: Split your Data**


In [None]:
X_train, X_test, y_train, y_test = # COMPLETE THIS CODE

##### **Solution**

#### **Step #4: Import an ML Algorithm**




##### **Solution**

#### **Step #5: Initialize the Model**


In [None]:
model = # COMPLETE THIS LINE

##### **Solution**

#### **Step #6: Fit and Test**


In [None]:
model.fit(X_train, # COMPLETE THIS LINE

In [None]:
predictions = # COMPLETE THIS LINE

##### **Solution**

#### **Step #7: Evaluate**

Let's evaluate this model and put it to the test! Specifically, use the accuracy score to get a simple overall picture of your model's performance, and the confusion matrix to get a more nuanced view of where the model is performing the best and worst


In [None]:
print(accuracy_score(# COMPLETE THIS CODE

In [None]:
cm = confusion_matrix(# COMPLETE THIS CODE
disp = ConfusionMatrixDisplay(# COMPLETE THIS CODE
disp.plot()
plt.show()

###### **Solution**

#### **Step #8: Apply your Model**

You are provided with data from two new breast cancer samples, and you want to assess the predicted class labels (Malignant or Benign) for each of them. The goal is to determine whether either sample is likely to be malignant or benign based on the model's predictions.

Here is the data for the two samples:

**Sample 1:**

* Mean Radius = 12.5
* Mean Texture = 18.2
* Mean Perimeter = 80.3
* Mean Area = 490.2
* Mean Smoothness = 0.09
* Mean Compactness = 0.08
* Mean Concavity = 0.05
* Mean Concave Points = 0.03
* Mean Symmetry = 0.18
* Mean Fractal Dimension = 0.06

**Sample 2:**

* Mean Radius = 14.3
* Mean Texture = 20.8
* Mean Perimeter = 92.6
* Mean Area = 650.9
* Mean Smoothness = 0.1
* Mean Compactness = 0.12
* Mean Concavity = 0.09
* Mean Concave Points = 0.05
* Mean Symmetry = 0.2
* Mean Fractal Dimension = 0.07

You will use your KNN (k-nearest neighbors) model to predict the class labels for these samples and assess their relative likelihood of being malignant or benign based on the predictions.

##### **1. Predict the diagnosis of Sample 1**


In [None]:
sample_1_features = pd.DataFrame([[# COMPLETE THIS CODE

prediction_sample_1 = model.predict(# COMPLETE THIS CODE

print("Predicted label for Sample 1:", "Malignant" if prediction_sample_1[0] == 1 else "Benign")

###### **Solution**

##### **2. Predict the diagnosis of Sample 2**

In [None]:
sample_2_features = pd.DataFrame([[# COMPLETE THIS CODE

prediction_sample_2 = model.predict(# COMPLETE THIS CODE

print("Predicted label for Sample 2:", "Malignant" if prediction_sample_1[0] == 1 else "Benign")

###### **Solution**

### **Part 5.3: Logistic Regression**


---

The Pima Indians Diabetes dataset is an essential collection of medical records related to diabetes diagnoses among women of Pima Indian heritage. It comprises various attributes, including the number of times pregnant, plasma glucose concentration, diastolic blood pressure, triceps skinfold thickness, and several others. The target variable (label) indicates whether an individual has diabetes.

In Part 5.3, we will develop a Logistic Regression model to predict diabetes diagnoses based on these features. The primary objective is to classify new individuals as either having diabetes or not, based on the provided attribute values.

#### **Step #1: Load in Data**

**Run the code below to load the data.**

In [None]:
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"

column_names = [
    "Number of times pregnant",
    "Plasma glucose concentration",
    "Diastolic blood pressure",
    "Triceps skinfold thickness",
    "2-Hour serum insulin",
    "BMI",
    "Diabetes pedigree function",
    "Age",
    "Class"
]

df = pd.read_csv(url, names=column_names)

#### **Step #2: Choose your Variables**



In [None]:
inputs = # COMPLETE THIS CODE
output = # COMPLETE THIS CODE

##### **Solution**

#### **Step #3: Split your Data**


In [None]:
X_train, X_test, y_train, y_test = # COMPLETE THIS CODE

##### **Solution**

#### **Step #4: Import an ML Algorithm**




##### **Solution**

#### **Step #5: Initialize the Model**


In [None]:
model = # COMPLETE THIS LINE

##### **Solution**

#### **Step #6: Fit and Test**


In [None]:
model.fit(X_train, # COMPLETE THIS LINE

In [None]:
predictions = # COMPLETE THIS LINE

##### **Solution**

#### **Step #7: Evaluate**

Let's evaluate this model and put it to the test! Specifically, use the accuracy score to get a simple overall picture of your model's performance, and the confusion matrix to get a more nuanced view of where the model is performing the best and worst


In [None]:
report = classification_report(# COMPLETE THIS LINE
print('Classification report ' + str(report))

In [None]:
cm = confusion_matrix(# COMPLETE THIS CODE
disp = ConfusionMatrixDisplay(# COMPLETE THIS CODE
disp.plot()
plt.show()

###### **Solution**

#### **Step #8: Apply your Model**

You are provided with data from two new Pima Indian individuals, and you want to assess the predicted class labels (Diabetes or No Diabetes) for each of them. The goal is to determine whether either individual is likely to have diabetes based on the model's predictions.

Here is the data for the two individuals:

**Individual 1:**

* Number of times pregnant: 2
* Plasma glucose concentration: 85
* Diastolic blood pressure: 66
* Triceps skinfold thickness: 29
* 2-Hour serum insulin: 0
* BMI: 26.6
* Diabetes pedigree function: 0.351
* Age: 31

**Individual 2:**

* Number of times pregnant: 8
* Plasma glucose concentration: 183
* Diastolic blood pressure: 64
* Triceps skinfold thickness: 0
* 2-Hour serum insulin: 0
* BMI: 23.3
* Diabetes pedigree function: 0.672
* Age: 32

You will use your logistic regression model to predict the class labels for these individuals and assess their relative likelihood of having diabetes based on the predictions.

##### **1. Predict the diagnoses for Individual 1**


In [None]:
individual_1_features = pd.DataFrame([[# COMPLETE THIS CODE

prediction_individual_1 = model.predict(# COMPLETE THIS CODE

print("Predicted label for Individual 1:", "Diabetes" if prediction_individual_1[0] == 1 else "No Diabetes")

###### **Solution**

##### **2. Predict the diagnosis of Sample 2**

In [None]:
individual_2_features = pd.DataFrame([[# COMPLETE THIS CODE

prediction_individual_2 = model.predict(# COMPLETE THIS CODE

print("Predicted label for Individual 2:", "Diabetes" if prediction_individual_2[0] == 1 else "No Diabetes")

###### **Solution**

---

# End of Notebook

© 2023 The Coding School, All rights reserved