<a href="https://colab.research.google.com/github/syulimo/3rd-ML100Days/blob/master/Lecture_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Modeling

The goal of this workshop is to understand how to develop models to accurately predict target variables of a sample dataset. We'll also discuss how to analyze model performance.

## Content

1. Preparing Data
- Standardization
- Train-Test Split
2. Binary Classification
- Logistic Regression
3. Metrics to Assess Performance
- Confusion Matrix
- Precision & Recall
- F-1 Score
4. More Binary Classification Methods
- Support Vector Machine
- Decision Tree
- Random Forest

# Section 0: Import Libraries

### **Q0.1** Import pandas, numpy, and scikit-learn

In [None]:
import numpy as np
import pandas as pd
import sklearn

### **Q0.2** Import "stroke_prediction_2.csv"

In [None]:
df = pd.read_csv('stroke_prediction_2.csv')
df.head()

In [None]:
print("Stroke Incidence: ", df.sum(axis=0)['stroke']/df.shape[0]*100, "%")

### **Q0.3** Prepare "stroke_prediction_2.csv" via one-hot encoding

In [None]:
df = pd.get_dummies(df, columns=['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status'], drop_first=True)

### **Q0.4** Split dataset into X and y

In [None]:
target = ['stroke']
X = df.drop(target+['id'], axis=1)
y = df[target].values.ravel()

# Section 1: Preprocess and Standardize Data
Using ```sklearn```

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [None]:
print(f"X: {X.shape}   |    y: {y.shape}")

### **Q1.1** Train-test split the data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
print(f"Test:  X: {X_test.shape}    |    y: {y_test.shape}")
print(f"Train: X: {X_train.shape}   |    y: {y_train.shape}")

### **Q1.2** Standardize the training data, and apply the same fit to the test data

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Section 2: Implement Logistic Regression Binary Classification

### **Q2.1** Build LR Model and generate a prediction based on test data



In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
model = LogisticRegression()
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)

### **Q2.2** Determine Accuracy of LR model

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

### **Q2.3** Assemble confusion matrix for LR model performance

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
print(f' TP: {tp}    FP: {fp}\n FN: {fn}    TN: {tn}')

### **Q2.4:** Compute the Precision, Recall, and F1 Score of our Model

In [None]:
# type your answer here

### **Q2.5** Obtain the weights (coefficients) of the LR model

In [None]:
lr_coef = pd.DataFrame({"weight": model.coef_[0]})
lr_coef["coef"] = X.columns
lr_coef = lr_coef[["coef", "weight"]]
lr_coef.sort_values(by="weight", ascending=False)

# Section 3: Implement a Support Vector Machine

### **Q3.1** Build SVM Model (SVC) and generate a prediction based on test data

In [None]:
from sklearn.svm import SVC

In [None]:
# type your answer here

In [None]:
# apply the same analysis as before

# Section 4: Implement a Decision Tree Model

### **Q4.1** Build a decision tree model generate a prediction based on test data

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
# type your answer here

In [None]:
# apply the same analysis as before

### **Q4.2** Visualize the decision tree

In [None]:
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt
plt.figure(figsize=(20, 10))
plot_tree(dt_model, filled=True, feature_names=X.columns, class_names=['No Stroke', 'Stroke'])
plt.title("Decision Tree for Stroke Prediction")
plt.show()

# Section 5: Implement a Random Forest Classifier

### **Q5.1** Build a decision tree model generate a prediction based on test data

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
# type your answer here

In [None]:
# apply the same analysis as before

**That wraps up DEEP Lecture #3!**

See you next week where you'll be able to apply the techniques discussed in today's workshop to your team's dataset!