<p align = "center" draggable=”false” ><img src="https://user-images.githubusercontent.com/37101144/161836199-fdb0219d-0361-4988-bf26-48b0fad160a3.png" 
     width="200px"
     height="auto"/>
</p>

# <h1 align="center" id="heading">Machine Learning Engineering Onramp</h1>
# <h2 align="center" id="heading">MLE Basic Toolkit 🧰</h2>


## 📚 Learning Objectives

In this notebook, you will learn the basics of some of the most widely used libraries in Python for Machine Learning:
* `pandas` - for data manipulation and exploratory data analysis,
* `scikit-learn` aka `sklearn` - for predictive data analysis and machine learning
* `matplotlib` and `seaborn` - for data visualization

## Pandas 🐼

[Pandas](https://pandas.pydata.org/) is a powerful, versatile and easy-to-use tool for data mannipulation and data analysis and is the ML Engineer's Swiss Army knife for tabular data.

### Let's start off by creating a DataFrame from scratch!

A `DataFrame` is a data structure that is used to organize data into a 2-dimensional table of rows and columns.  You can think of a DataFrame as a spreadsheet or SQL table.

First we'll import `pandas` using the conventional abbreviation `pd`, which will not only save us some keystrokes, but also make our code less verbose.  While we can abbreviate any package with any notation, it is best to follow the conventions set by the authors of the package.  

In [1]:
import pandas as pd

Now let's create a DataFrame using the following steps:
1. Create a couple of lists
2. Create a dictionary mapping for those lists
3. Create a DataFrame

In [2]:
names = ["Greg", "Sina", "Milica", "Chris", "Michael"]
nums = [1, 2, 3, 4, 5]

data = {"names":names, "nums":nums}

df = pd.DataFrame(data)

print(df)

     names  nums
0     Greg     1
1     Sina     2
2   Milica     3
3    Chris     4
4  Michael     5


We can also easily add data by creating new rows or columns.

In [3]:
# Add a new column for roles
roles = ["Head of Product", "Instructor", "Instructor", "Instructor", "Instructor"]
df["roles"] = roles

# Append a new row using a list
df.loc[len(df)] = ["Ali", 6, "Instructor"] 
print(df)

     names  nums            roles
0     Greg     1  Head of Product
1     Sina     2       Instructor
2   Milica     3       Instructor
3    Chris     4       Instructor
4  Michael     5       Instructor
5      Ali     6       Instructor


In [4]:
# Append a new row using a pd.Series
# new = ["Bruno", 7, "Instructor"]
# df = df.append(pd.Series(new, index=df.columns[:len(new)]), ignore_index=True)

# Append a new row using pd.concat 
new_row = pd.DataFrame({"names": "Bruno", "nums": 7, "roles": "Instructor"}, index=[0])
df = pd.concat([df, new_row], ignore_index=True)
print(df)

     names  nums            roles
0     Greg     1  Head of Product
1     Sina     2       Instructor
2   Milica     3       Instructor
3    Chris     4       Instructor
4  Michael     5       Instructor
5      Ali     6       Instructor
6    Bruno     7       Instructor


What happens if we specify an index that already exists in the DataFrame?

In [5]:
df.loc[6] = ["Milan", 8, "Instructor"] 
df

Unnamed: 0,names,nums,roles
0,Greg,1,Head of Product
1,Sina,2,Instructor
2,Milica,3,Instructor
3,Chris,4,Instructor
4,Michael,5,Instructor
5,Ali,6,Instructor
6,Milan,8,Instructor


Ooops, we have overwritten Bruno!  Let's add him back, but this time let's use an insert.  There isn't a great way to do this in Pandas, but we can also use `numpy` to help us out with this!

In [6]:
import numpy as np
print(df.values)
df = pd.DataFrame(np.insert(df.values, 6, values=["Bruno", 7, "Instructor"], axis=0), columns=df.columns)
df

[['Greg' 1 'Head of Product']
 ['Sina' 2 'Instructor']
 ['Milica' 3 'Instructor']
 ['Chris' 4 'Instructor']
 ['Michael' 5 'Instructor']
 ['Ali' 6 'Instructor']
 ['Milan' 8 'Instructor']]


Unnamed: 0,names,nums,roles
0,Greg,1,Head of Product
1,Sina,2,Instructor
2,Milica,3,Instructor
3,Chris,4,Instructor
4,Michael,5,Instructor
5,Ali,6,Instructor
6,Bruno,7,Instructor
7,Milan,8,Instructor


In [7]:
print(type(df.values))

<class 'numpy.ndarray'>


## Diving Deeper 🤿 
Now that we can create our own DataFrame from scratch, let's take another dataset and clean it up and analyze it!  We will load the [diabetes dataset](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html) that comes in pre-loaded in `sklearn`.  `Sklearn` comes preloaded with several [datasets](https://scikit-learn.org/stable/datasets/toy_dataset.html) and are great for practicing data exploration and ML techniques!

In [8]:
from sklearn.datasets import load_diabetes
dfX, Y = load_diabetes(return_X_y=True, as_frame=True)

dfX

ModuleNotFoundError: No module named 'sklearn'

In this dataset, we are given 10 variables of interest to predict diabetes progression after one year.  Notice that columns 5-10 are not given meaningful names, let's rename this columns using a dictionary mapping.

In [None]:
cols = {"s1":"tc", "s2":"ldl", "s3":"hdl", "s4":"tch", "s5":"ltg", "s6":"glu"}
dfX = dfX.rename(columns=cols)
dfX

Though, our X values are returned to us as a DataFrame, our Y values are returned as a Pandas series ([pd.series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html)).  Let's create a DataFrame for it as well!

In [None]:
dfY = pd.DataFrame({"disease_progression": Y})
dfY

## Exploratory Data Analysis 🗺️ 🧭 ⛏️
Now let's begin exploring the data!  We can call the `shape` function in Pandas to learn the dimensions our DataFrames.  

In [None]:
print(dfX.shape)

We can see that our DataFrames contain 442 rows and 10 columns.  Let's make sure that our `dfY` DataFrame.  Give a try in the cell below!

In [None]:
# YOUR CODE GOES HERE!

Now that we know that our DataFrames are the same length, let's add the dependent variable (Y) DataFrame to our DataFrame with our independent variables (X).  

We can also use the `head` function to return the first rows of the DataFrame.  This allows us to peek 👀 the data, rather than return the whole DataFrame which can be helpful for large datasets.

In [None]:
dfX["disease_prog"] = dfY
dfX.head(5)

Let's see if we have any null values that we need to deal with.

In [None]:
dfX.isnull().sum()


It doesn't look like we have any nulls to deal with.  Let's use the `describe` function to give us some descriptive statistics about the dataset! 

In [None]:
dfX.describe()

## 📈 Visualize the Data

Let's visualize the data with the help of [matplotlib](https://matplotlib.org/) and [seaborn](https://seaborn.pydata.org/) libraries.  `Matplotlib` is a comprehensize data visualization library and can be used for customized static, as well as dynamic and interactive plots.  `Seaborn` is a high-level library for common statistical visualizations that runs `matplotlib` under the hood.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

#figure size
plt.figure(figsize=(10, 7))

sns.displot(dfX["disease_prog"])

Let's also have a look at the correlations among our variables using a heatmap.

In [None]:
corr = dfX.corr()

#figure size
plt.figure(figsize=(10, 7))

sns.heatmap(dfX.corr(), annot=True, fmt='.2f')

Since half of the data in the heatmap above is redundant, we can mask the upper triangle using a mask, which gives us a cleaner, less busy, correlogram.

In [None]:
#figure size
plt.figure(figsize=(10, 7))

mask = np.triu(np.ones_like(dfX.corr()))
sns.heatmap(dfX.corr(), mask=mask, annot=True, fmt='.2f')

We see that there is a moderate correlation between `disease_prog` and `ltg` (0.57) and `bmi` (0.59).  It is also worth noting that there is a strong correlation between `ldl` and `tc` (0.9).

Let's see if we can note any relationship between these variables by looking at some pair plots.

In [None]:
#figure size
plt.figure(figsize=(10, 7))

#plotting graphs
sns.pairplot(dfX[["bmi", "ltg", "disease_prog"]])

The relationships above look approximately linear.  We can also plot a regression plot to evaluate the linear relationship a more closely.


In [None]:
plt.figure(figsize=(10, 7))

#regression between bmi and progression
sns.regplot(data=dfX, x="bmi", y="disease_prog",line_kws={"color": "red"})

#labeling
plt.title("BMI VS Disease Progression")
plt.xlabel("BMI")
plt.ylabel("Disease Progression")

In [None]:
# Set the figure size
plt.figure(figsize=(10, 7))

#regression between bmi and progression
sns.regplot(data=dfX, x="ltg", y="disease_prog",line_kws={"color": "red"})

#labeling
plt.title("LTG VS Disease Progression")
plt.xlabel("LTG")
plt.ylabel("Disease Progression")

While there does appear to be a linear relationship between both `bmi` and `ltg` and `disease_prog`, there is a lot of dispersion as well, but this also make sense since we only had a moderate correlations.

Let's drop our Y value, `disease_prog`, from the DataFrame so that we can perform a regression analysis using sklearn!

In [None]:
dfX = dfX.drop(["disease_prog"], axis=1)
dfX

## Split the data ➗

Now let's split our dataset into `train` and `test` datasets - this will allow our model to learn from the `train` data and we can evaluate the performance of the model on the `test` data that it hasn't seen before.  Since this is a small dataset, we will use 90% of the data for training and 10% for testing.  We will also set a `random_state` for reproducibility!

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test  = train_test_split(dfX, dfY, test_size=0.1,random_state=42)

## Making Predictions 🔮 

Now we can fit the data to the model by calling the `fit` (dot-fit) function.

In [None]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression().fit(X_train, y_train)

y_test_pred = lr.predict(X_test)
y_train_pred = lr.predict(X_train)

## Evaluate the Model 📏 

Now we can fit the model with the data, make some predictions and evaluate our model using the [coefficient of determination](https://en.wikipedia.org/wiki/Coefficient_of_determination) (R^2 aka R-squared).  This will tell us how much of the variance of the data can be explained by the model.

In [None]:
#intercept
print(f"the intercept is:{lr.intercept_[0]: .2f}")

#slopes
print(f"the slopes are:{lr.coef_}")

#Score model using R^2
print(f"R2 on train set:{lr.score(X_train, y_train): .2f}") #train set
print(f"R2 on test set:{lr.score(X_test, y_test): .2f}") #test set

The R2 is only around 0.5, which demonstrates a moderate fit. While this might seem low, it can be perfectly acceptable in some cases. This result is not totally unexpected as we only had moderate correlations between our independent and dependent variables.  

## Conclusions 🧑‍🏫 

In this exercise, we went through many of the basic functions of `pandas`, performed an exploratory data analysis (EDA), made data visualizations using `matplotlib` and `seaborn` and went through the ML worflow in `sklearn`.  Now that we have covered many of the basics of some of the core packages that we will be using in this course, you've got a firm foundation to build on!