# TASK #1: UNDERSTAND THE PROBLEM STATEMENT AND BUSINESS CASE

![image-2.png](attachment:image-2.png)

![image.png](attachment:image.png)

# TASK #2: IMPORT LIBARIES AND DATASETS

In [None]:
# import the necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from jupyterthemes import jtplot 
jtplot.style(theme = 'monokai', context = 'notebook', ticks = True, grid = False)

In [None]:
# read the csv file 
df = pd.read_csv("cardio_train.csv", sep=";")

In [None]:
df.head()

**PRACTICE OPPORTUNITY #1 [OPTIONAL]:**
- **Display the last 5, 8, and 10 rows in the df DataFrame**

# TASK #3: PERFORM EXPLORATORY DATA ANALYSIS

In [None]:
# Drop id
df = df.drop(columns = 'id')

In [None]:
# since the age is given in days, we convert it into years
df['age'] = df['age']/365

In [None]:
df.head()

In [None]:
# Statistical summary of the dataframe
df.describe()

In [None]:
df.hist(bins = 30, figsize = (20,20), color = 'r')
plt.show()

In [None]:
# get the correlation matrix
corr_matrix = df.corr()
corr_matrix

In [None]:
# plotting the correlation matrix
plt.figure(figsize = (16,16))
sns.heatmap(corr_matrix, annot = True)
plt.show()

# TASK #4: CREATE TRAINING AND TESTING DATASET

In [None]:
# split the dataframe into target and features
y = df['cardio']
X = df.drop(columns =['cardio'])

In [None]:
X.shape

In [None]:
y.shape

In [None]:
#spliting the data in to test and train sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [None]:
X_train.shape

In [None]:
y_train.shape

In [None]:
X_test.shape

In [None]:
y_test.shape

# TASK #5: UNDERSTAND XG-BOOST ALGORITHM TO SOLVE CLASSIFICATION TYPE PROBLEMS

![image.png](attachment:image.png)

![image.png](attachment:image.png)

# TASK #6: TRAIN AN XG-BOOST CLASSIFIER IN SK-LEARN

In [None]:
!pip install xgboost

In [None]:
from xgboost import XGBClassifier

In [None]:
# Train an XGBoost classifier model 

xgb_classifier = XGBClassifier(objective ='binary:logistic', eval_metric = 'error', learning_rate = 0.1, max_depth = 1, n_estimators = 10)
xgb_classifier.fit(X_train, y_train)

# TASK #7: TEST XGBOOST CLASSIFIER TO PERFORM INFERENCE

In [None]:
# predict the score of the trained model using the testing dataset
result = xgb_classifier.score(X_test, y_test)
print("Accuracy : {}".format(result))

In [None]:
# make predictions on the test data
y_predict = xgb_classifier.predict(X_test)
y_predict

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_predict))
# precision is the ratio of TP/(TP+FP)
# recall is the ratio of TP/(TP+FN)
# F-beta score can be interpreted as a weighted harmonic mean of the precision and recall
# where an F-beta score reaches its best value at 1 and worst score at 0. 


In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_predict)
sns.heatmap(cm, fmt = 'd', annot = True)

**PRACTICE OPPORTUNITY #2 [OPTIONAL]:**
- **Try a larger max_depth and retrain the model**
- **Assess the performance of the trained model**
- **What do you conclude?**

# FINAL CAPSTONE PROJECT

Using "Diabetes.csv" dataset, perform the following:
- 1. Load the “diabetes.csv” dataset using Pandas
- 2. Split the data into 80% for training and 20% for testing 
- 3. Train an XG-Boost classifier model using SK-Learn Library
- 4. Assess trained model performance
- 5. Plot the confusion matrix
- 6. Print the classification report

# PRACTICE OPPORTUNITIES SOLUTION

**PRACTICE OPPORTUNITY #1 SOLUTION:**
- **Display the last 5, 8, and 10 rows in the df DataFrame**

In [None]:
df.tail()

In [None]:
df.tail(8)

In [None]:
df.tail(10)

**PRACTICE OPPORTUNITY #2 SOLUTION:**
- **Try a much larger max_depth and retrain the model**
- **Assess the performance of the trained model**
- **What do you conclude?**

In [None]:
# Train an XGBoost classifier model 

xgb_classifier = XGBClassifier(objective ='binary:logistic', eval_metric = 'error', learning_rate = 0.1, max_depth = 10, n_estimators = 10, use_label_encoder=False)
xgb_classifier.fit(X_train, y_train)


# make predictions on the test data
y_predict = xgb_classifier.predict(X_test)

from sklearn.metrics import classification_report
print(classification_report(y_test, y_predict))

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_predict)
sns.heatmap(cm, fmt = 'd', annot = True)

# FINAL CAPSTONE PROJECT SOLUTION

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# You have to include the full link to the csv file containing your dataset
df = pd.read_csv('diabetes.csv')

In [None]:
df.info()

In [None]:
# Plot Histogram
df.hist(bins = 30, figsize = (20,20), color = 'b');

In [None]:
# Plot the correlation matrix
correlations = df.corr()
f, ax = plt.subplots(figsize = (20, 20))
sns.heatmap(correlations, annot = True);

In [None]:
y = df['Outcome']
y

In [None]:
X = df.drop(['Outcome'], axis = 1)
X

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [None]:
X_train.shape

In [None]:
X_test.shape

In [None]:
# Train an XGBoost classifier model 

xgb_classifier = XGBClassifier(objective ='binary:logistic', eval_metric = 'error', learning_rate = 0.1, max_depth = 1, n_estimators = 10, use_label_encoder=False)
xgb_classifier.fit(X_train, y_train)

In [None]:
# predict the score of the trained model using the testing dataset
result = xgb_classifier.score(X_test, y_test)
print("Accuracy : {}".format(result))

In [None]:
# make predictions on the test data
y_predict = xgb_classifier.predict(X_test)
y_predict

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_predict))
# precision is the ratio of TP/(TP+FP)
# recall is the ratio of TP/(TP+FN)
# F-beta score can be interpreted as a weighted harmonic mean of the precision and recall
# where an F-beta score reaches its best value at 1 and worst score at 0. 


In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_predict)
sns.heatmap(cm, fmt = 'd', annot = True)

# EXCELLENT JOB!