# Python Programming

**Chapter 7 : Basic Data Science with Python** 

Python is a fun language to learn, and really easy to pick up even if you are new to programming. In fact, quite often, Python is easier to pick up if you do not have any programming experience whatsoever. Python is high level programming language, targeted at students and professionals from diverse backgrounds.

In this chapter, we will cover
- Essential Libraries
- Case Study : Linear Regression
- Case Study : Classification
- Case Study : Clustering

**License Declaration** : Following the lead from the inspirations for this material, and the *spirit* of Python education and development, all modules of this work are licensed under the Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0/.

---

## Essential Libraries

Let us begin by importing the essential Python Libraries.    
You may install any library using `conda install <library>`.    
Most of the libraries come by default with the Anaconda platform.

> NumPy : Library for Numeric Computations in Python  
> Pandas : Library for Data Acquisition and Preparation  
> Matplotlib : Low-level library for Data Visualization  
> Seaborn : Higher-level library for Data Visualization  

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics

We will also need the most common Python libraries for (basic) Machine Learning.      
Scikit-Learn (`sklearn`) will be our de-facto Machine Learning library in Python.   

**Linear Regression**
> `LinearRegression` model from `sklearn.linear_model` : Our main model for Regression   
> `mean_squared_error` metric from `sklearn.metrics` : Performance metric for Regression       

**Classification Tree**
> `DecisionTreeClassifier` model from `sklearn.tree` : Our main model for Classification   
> `plot_tree` method from `sklearn.tree` : Function to clearly visualize a Classification Tree   
> `confusion_matrix` metric from `sklearn.metrics` : Performance metric for Classification     

**K-Means Clustering**
> `KMeans` model from `sklearn.cluster` : Our main model for Clustering   

*Common Functionality*
> `train_test_split` method from `sklearn.model_selection` : Random Train-Test splits     

In [None]:
# Import essential models and functions from sklearn

# Linear Regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Classification Tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree
from sklearn.metrics import confusion_matrix

# K-Means Clustering
from sklearn.cluster import KMeans

# Common Functionality
from sklearn.model_selection import train_test_split

---

## Case Study : Linear Regression

We use the **"Pokemon with stats"** dataset from Kaggle, curated by *Alberto Barradas* (https://www.kaggle.com/abcsds/pokemon).     

### Import the Dataset

The dataset is in CSV format; hence we use the `read_csv` function from Pandas.  
Immediately after importing, take a quick look at the data using the `head` function.

In [None]:
# Read the CSV Data
pkmndata = pd.read_csv('files/pokemonData.csv')
pkmndata.head()

Check the vital statistics of the dataset using the `type` and `shape` attributes.     
Check the variables (and their types) in the dataset using the `info()` method.

In [None]:
print("Data type : ", type(pkmndata))
print("Data dims : ", pkmndata.shape)
print()
pkmndata.info()

### Relationship between Numeric Variables

Check the mutual relationship between the numeric variables using Correlation and Jointplots.   

In [None]:
# Extract only the numeric data variables
numDF = pd.DataFrame(pkmndata[["Total", "HP", "Attack", "Defense", "Sp. Atk", "Sp. Def", "Speed"]])

# Correlation Matrix
print(numDF.corr())

# Heatmap of the Correlation Matrix
f, axes = plt.subplots(1, 1, figsize=(18, 12))
sb.heatmap(numDF.corr(), vmin = -1, vmax = 1, annot = True, fmt = ".2f", annot_kws = {"size": 18}, cmap = "RdBu")

In [None]:
# Draw pairs of variables against one another
sb.pairplot(data = numDF)

### Uni-Variate Regression

We will start by setting up a Uni-Variate Linear Regression problem.   

> Regression Model : Response = $a$ $\times$ Predictor + $b$  

Check the mutual relationship between the variables to start with.

In [None]:
# Set up the problem with Predictor(s) and Response
predictor = "HP"
response = "Total"

# 2D scatterplot of two variables to observe their relationship
f = plt.figure(figsize=(16, 8))
sb.scatterplot(x = predictor, y = response, data = pkmndata)

Extract the Response and Predictor variables as two individual Pandas `DataFrame`.

In [None]:
# Extract Response and Predictors
y = pd.DataFrame(pkmndata[response])
X = pd.DataFrame(pkmndata[predictor])

Split the dataset randomly into Train and Test datasets using `train_test_split`.

In [None]:
# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

`LinearRegression` is a class for the regression model in `sklearn`.     
We need to create an object of the `LinearRegression` class, as follows.     

In [None]:
# Create a Linear Regression object
linreg = LinearRegression()

Train the Linear Regression model using the Train Set `X_train` and `y_train`.   

In [None]:
# Train the Linear Regression model
linreg.fit(X_train, y_train)

You have *trained* the model to fit the following formula.

>  Regression Problem : Response = $a$ $\times$ Predictor + $b$

Check Intercept ($b$) and Coefficient ($a$) of the regression line.

In [None]:
# Coefficients of the Linear Regression line
print('Intercept \t b = ', linreg.intercept_)
print('Coefficients \t a = ', linreg.coef_)

Predict the response variable using the model you just trained.

In [None]:
# Predict the Response on the Train Set
y_train_pred = linreg.predict(X_train)

# Plot the Linear Regression line
f = plt.figure(figsize=(16, 8))
plt.scatter(X_train, y_train)
plt.scatter(X_train, y_train_pred, color = "red")
plt.show()

Check the *Goodness of Fit* on the Train and Test Sets.    
Metrics : Explained Variance and Mean Squared Error.

In [None]:
# Explained Variance (R^2) on Train Set
print("Explained Variance (R^2) on Train Set \t", linreg.score(X_train, y_train))

# Mean Squared Error (MSE) on Train Set
y_train_pred = linreg.predict(X_train)
print("Mean Squared Error (MSE) on Train Set \t", mean_squared_error(y_train, y_train_pred))

# Mean Squared Error (MSE) on Test Set
y_test_pred = linreg.predict(X_test)
print("Mean Squared Error (MSE) on Test Set \t", mean_squared_error(y_test, y_test_pred))

It is quite meaningful to check the Predictions against the True values of the Response variable.

In [None]:
# Predict the Response for both Train and Test
y_train_pred = linreg.predict(X_train)
y_test_pred = linreg.predict(X_test)

# Plot the Predictions vs the True values
f, axes = plt.subplots(1, 2, figsize=(16, 8))
axes[0].scatter(y_train, y_train_pred, color = "blue")
axes[0].plot(y_train, y_train, 'w-', linewidth = 1)
axes[0].set_xlabel("True values of the Response Variable (Train)")
axes[0].set_ylabel("Predicted values of the Response Variable (Train)")
axes[1].scatter(y_test, y_test_pred, color = "green")
axes[1].plot(y_test, y_test, 'w-', linewidth = 1)
axes[1].set_xlabel("True values of the Response Variable (Test)")
axes[1].set_ylabel("Predicted values of the Response Variable (Test)")
plt.show()

#### Quick Tasks

- Write a generic function in Python to perform Uni-Variate Linear Regression on an input dataset and any Response-Predictor pair.

### Multi-Variate Linear Regression

Let us set up a Multi-Variate Linear Regression problem.   

> Regression Model : Response = $a_1$ $\times$ Predictor$_1$ + $a_2$ $\times$ Predictor$_2$ + $\cdots$ + $a_k$ $\times$ Predictor$_k$ + $b$      

Fortunately, our standard Linear Regression code works.   

In [None]:
# Specify the Predictors and Response
response = "Total"
predictors = ["HP", "Attack", "Defense", "Speed"]

# Extract Response and Predictors
y = pd.DataFrame(pkmndata[response])
X = pd.DataFrame(pkmndata[predictors])

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

# Train a Linear Regression Model
linreg = LinearRegression()         # create the linear regression object
linreg.fit(X_train, y_train)        # train the linear regression model

# Predict Response corresponding to Predictors
y_train_pred = linreg.predict(X_train)
y_test_pred = linreg.predict(X_test)

# Plot the Predictions vs the True values
f, axes = plt.subplots(1, 2, figsize=(16, 8))
axes[0].scatter(y_train, y_train_pred, color = "blue")
axes[0].plot(y_train, y_train, 'w-', linewidth = 1)
axes[0].set_xlabel("True values of the Response Variable (Train)")
axes[0].set_ylabel("Predicted values of the Response Variable (Train)")
axes[1].scatter(y_test, y_test_pred, color = "green")
axes[1].plot(y_test, y_test, 'w-', linewidth = 1)
axes[1].set_xlabel("True values of the Response Variable (Test)")
axes[1].set_ylabel("Predicted values of the Response Variable (Test)")
plt.show()

# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Explained Variance (R^2) \t", linreg.score(X_train, y_train))
print("Mean Squared Error (MSE) \t", mean_squared_error(y_train, y_train_pred))
print()

# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Mean Squared Error (MSE) \t", mean_squared_error(y_test, y_test_pred))
print()

#### Quick Tasks

- Write a generic function in Python to perform Multi-Variate Linear Regression on an input dataset and any Response-Predictor(s) set.

---

## Case Study : Classification

We use the **"Pokemon with stats"** dataset from Kaggle, curated by *Alberto Barradas* (https://www.kaggle.com/abcsds/pokemon).     

### Import the Dataset

The dataset is in CSV format; hence we use the `read_csv` function from Pandas.  
Immediately after importing, take a quick look at the data using the `head` function.

In [None]:
# Read the CSV Data
pkmndata = pd.read_csv('files/pokemonData.csv')
pkmndata.head()

Check the vital statistics of the dataset using the `type` and `shape` attributes.     
Check the variables (and their types) in the dataset using the `info()` method.

In [None]:
print("Data type : ", type(pkmndata))
print("Data dims : ", pkmndata.shape)
print()
pkmndata.info()

### Uni-Variate Classification

We will start by setting up a Uni-Variate Classification problem.   

> Classification Model : Response vs Predictor 

Check the mutual relationship between the variables to start with.

In [None]:
# Set up the problem with Predictor(s) and Response
predictor = "HP"
response = "Legendary"

# Convert the response variable to Category
pkmndata[response] = pkmndata[response].astype("category")

# Boxplot of numeric variable against categorical variable
f = plt.figure(figsize=(16, 4))
sb.boxplot(x = predictor, y = response, data = pkmndata)

Extract the Response and Predictor variables as two individual Pandas `DataFrame`.

In [None]:
# Extract Response and Predictors
y = pd.DataFrame(pkmndata[response])
X = pd.DataFrame(pkmndata[predictor])

# Convert the response to Binary
y[response] = y[response].astype("bool")

Split the dataset randomly into Train and Test datasets using `train_test_split`.

In [None]:
# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

`DecisionTreeClassifier` is a class for the classification model in `sklearn`.     
We need to create an object of the `DecisionTreeClassifier` class, as follows.

In [None]:
# Create a Decision Tree Classifier object
dectree = DecisionTreeClassifier(max_depth = 2)

Train the Classification Tree model using the Train Set `X_train` and `y_train`.   

In [None]:
# Train the Decision Tree model
dectree.fit(X_train, y_train)

In [None]:
# Plot the Decision Tree model
f, axes = plt.subplots(1, 1, figsize=(16, 12))

plot_tree(dectree, 
          filled=True, 
          feature_names=X_train.columns,
          class_names=["False","True"],
          rounded = True)
plt.show()

Check the *Goodness of Fit* on the Train and Test Sets.    
Metrics : Classification Accuracy and Confusion Matrix.

In [None]:
# Predict Response corresponding to Predictors
y_train_pred = dectree.predict(X_train)
y_test_pred = dectree.predict(X_test)

# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Classification Accuracy \t:", dectree.score(X_train, y_train))
print()

# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Classification Accuracy \t:", dectree.score(X_test, y_test))
print()

# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(y_train, y_train_pred),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
sb.heatmap(confusion_matrix(y_test, y_test_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])

#### Quick Tasks

- Write a generic function in Python to perform Uni-Variate Classification on an input dataset and any Response-Predictor pair.

### Multi-Variate Classification

Let us set up a Multi-Variate Classification problem.   

> Regression Model : Response vs {Predictor$_1$, Predictor$_2$, $\ldots$, Predictor$_k$}      

Fortunately, our standard Classification code works.   

In [None]:
# Specify the Predictors and Response
response = "Legendary"
predictors = ["HP", "Attack", "Defense", "Speed"]

# Extract Response and Predictors
y = pd.DataFrame(pkmndata[response])
X = pd.DataFrame(pkmndata[predictors])

# Convert the response to Binary
y[response] = y[response].astype("bool")

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

# Decision Tree using Train Data
dectree = DecisionTreeClassifier(max_depth = 2)  # create the decision tree object
dectree.fit(X_train, y_train)                    # train the decision tree model

# Predict Response corresponding to Predictors
y_train_pred = dectree.predict(X_train)
y_test_pred = dectree.predict(X_test)

# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Classification Accuracy \t:", dectree.score(X_train, y_train))
print()

# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Classification Accuracy \t:", dectree.score(X_test, y_test))
print()

# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(y_train, y_train_pred),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
sb.heatmap(confusion_matrix(y_test, y_test_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])

# Plot the Decision Tree
f, axes = plt.subplots(1, 1, figsize=(16, 12))
plot_tree(dectree, 
          filled=True, 
          feature_names=X_train.columns,
          class_names=["False","True"],
          rounded = True)
plt.show()

#### Quick Tasks

- Write a generic function in Python to perform Multi-Variate Classification on an input dataset and any Response-Predictor(s) set.

---

## Case Study : Clustering

We use the **"Iris"** dataset from within `scikit-learn`.     
Ref : https://scikit-learn.org/stable/datasets/toy_dataset.html#iris-dataset

### Import the Dataset

The dataset is in a special internal format; hence we will extract the `data` portion.  
Immediately after importing, take a quick look at the data using the `head` function.

In [None]:
# Load the internal dataset
from sklearn.datasets import load_iris
iris = load_iris(as_frame = True)

# Take only the attributes
irisData = pd.DataFrame(iris['data'])
irisData.columns = ["sepalLength", "sepalWidth", "petalLength", "petalWidth"]
irisData.head()

Check the vital statistics of the dataset using the `type` and `shape` attributes.     
Check the variables (and their types) in the dataset using the `info()` method.

In [None]:
print("Data type : ", type(irisData))
print("Data dims : ", irisData.shape)
print()
irisData.info()

### Bi-Variate Clustering : Sepal-related Features

We will start by setting up a Bi-Variate Clustering problem, using only Sepal features.      
Extract the required variables from the dataset, and then perform Bi-Variate Clustering.     

In [None]:
# Extract the Features from the Data
sepalData = pd.DataFrame(irisData[['sepalLength','sepalWidth']])
                           
# Plot the Raw Data on a 2D grid
f = plt.figure(figsize=(16,8))
plt.scatter(x = "sepalLength", y = "sepalWidth", data = sepalData)                    

**Basic K-Means Clustering**

Guess the number of clusters from the 2D plot, and perform KMeans Clustering.    
We will use the `KMeans` clustering model from `sklearn.cluster` module.

In [None]:
# Import KMeans from sklearn.cluster
from sklearn.cluster import KMeans

# Guess the Number of Clusters
num_clust = 2

# Create Clustering Model using KMeans
kmeans = KMeans(n_clusters = num_clust)

# Fit the Clustering Model on the Data
kmeans.fit(sepalData)

We may use the model on the data to `predict` the clusters.

In [None]:
# Predict the Cluster Labels
labels = kmeans.predict(sepalData)

# Append Labels to the Data
sepalData_labeled = sepalData.copy()
sepalData_labeled["Cluster"] = pd.Categorical(labels)

In [None]:
# Visualize the Clusters in the Data
f = plt.figure(figsize=(16,8))
plt.scatter(x = "sepalLength", y = "sepalWidth", c = "Cluster", cmap = 'viridis', data = sepalData_labeled)

### Bi-Variate Clustering : Petal-related Features

We will start by setting up a Bi-Variate Clustering problem, using only Petal features.      
Extract the required variables from the dataset, and then perform Bi-Variate Clustering.       

In [None]:
# Extract the Features from the Data
petalData = pd.DataFrame(irisData[['petalLength','petalWidth']])
                           
# Plot the Raw Data on a 2D grid
f = plt.figure(figsize=(16,8))
plt.scatter(x = "petalLength", y = "petalWidth", data = petalData)                    

**Basic KMeans Clustering**

Guess the number of clusters from the 2D plot, and perform KMeans Clustering.    
We will use the `KMeans` clustering model from `sklearn.cluster` module.

In [None]:
# Import KMeans from sklearn.cluster
from sklearn.cluster import KMeans

# Guess the Number of Clusters
num_clust = 2

# Create Clustering Model using KMeans
kmeans = KMeans(n_clusters = num_clust)

# Fit the Clustering Model on the Data
kmeans.fit(petalData)

We may use the model on the data to `predict` the clusters.

In [None]:
# Predict the Cluster Labels
labels = kmeans.predict(petalData)

# Append Labels to the Data
petalData_labeled = petalData.copy()
petalData_labeled["Cluster"] = pd.Categorical(labels)

In [None]:
# Visualize the Clusters in the Data
f = plt.figure(figsize=(16,8))
plt.scatter(x = "petalLength", y = "petalWidth", c = "Cluster", cmap = 'viridis', data = petalData_labeled)

#### Quick Tasks

- Try out Bi-Variate Clustering on the same dataset (Sepal and Petal features) using Gaussian Mixture Model (`gmm`).