# Step 1 - Download the Dataset
Download the dataset from the following link:
https://www.kaggle.com/joniarroba/noshowappointments

# Step 2 - Reading the Dataset
Read the dataset into the Pandas DataFrame!
Does the dataset include any missing values? If so, drop them!
Hint: Pandas can do that with one line of code!

In [185]:
import pandas as pd
df = pd.read_csv("KaggleV2-May-2016.csv").dropna()
print("df shape: ", df.shape)
df.head(5)

df shape:  (110527, 14)


Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No


# Step 3: Feature Extraction
Extract the following features:
* Gender [M/F] - OneHot
* Age [X] - Scale not required for tree based algorithm
* Scholarship - Already encoded
* Hipertension - Already encoded
* Diabetes - Already encoded
* Alcoholism - Already encoded
* Handcap - Scale not required for tree based algorithm
* SMS_received - Already encoded

Note: you may see that some datasets names are not written using proper spelling. It is related to the dataset, for example:
Hipertension=Hypertension
Handcap=Handicap

In [186]:
features = df[["Gender", "Age", "Scholarship", "Hipertension", "Diabetes", "Alcoholism", "Handcap", "SMS_received"]]
target = df[["No-show"]]

# Step 4: Preprocessing

Perform any needed preprocessing of the chosen features, including:
* scaling
* Encoding, and
* dealing with NaN values.
Hint: use only the preprocessing steps you believe are useful.

In [187]:
#determine data styles to apply encoding and or scaling as required
print("Unique Gender: ",features["Gender"].unique())
print("Unique Scholarship: ",features["Scholarship"].unique())
print("Unique Hipertension: ",features["Hipertension"].unique())
print("Unique Diabetes: ",features["Diabetes"].unique())
print("Unique Alcoholism: ",features["Alcoholism"].unique())
print("Unique Handcap: ",features["Handcap"].unique())
print("Unique SMS_received: ",features["SMS_received"].unique())
print("Unique No-show: ",target["No-show"].unique())

features = pd.get_dummies(features, columns=["Gender"])

#check for NaN values in the dataframe
if features.isna().any().any():
    print("There are NaN values in the features dataframe.")
else:
    print("There are no NaN values in the features dataframe.")

#check for NaN values in the dataframe
if target.isna().any().any():
    print("There are NaN values in the target dataframe.")
else:
    print("There are no NaN values in the target dataframe.")

Unique Gender:  ['F' 'M']
Unique Scholarship:  [0 1]
Unique Hipertension:  [1 0]
Unique Diabetes:  [0 1]
Unique Alcoholism:  [0 1]
Unique Handcap:  [0 1 2 3 4]
Unique SMS_received:  [0 1]
Unique No-show:  ['No' 'Yes']
There are no NaN values in the features dataframe.
There are no NaN values in the target dataframe.


In [188]:
target.head(5)

Unnamed: 0,No-show
0,No
1,No
2,No
3,No
4,No


In [189]:
features.head(5)

Unnamed: 0,Age,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,Gender_F,Gender_M
0,62,0,1,0,0,0,0,1,0
1,56,0,0,0,0,0,0,0,1
2,62,0,0,0,0,0,0,1,0
3,8,0,0,0,0,0,0,1,0
4,56,0,1,1,0,0,0,1,0


# Step 5 - Splitting the Data

Split your data as follows:
* 80% training set
* 10% validation set
* 10% test set

In [190]:
# y data (target labels) is the values (.values) from the class 
y = target["No-show"].values

#grab the number of columns
x_columns = len(features.columns)

#x = all the values (.values) rows of the credit dataframe 
x = features.iloc[:, 0:x_columns].values

#training the model on x and y, 80% for training, 20% for testing (ie: 0.2), random state = 0 for consistant results
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)

#populating X_validate, Y_validate as test set, using 0.5 to keep half for testing
X_validate, X_test, y_validate, y_test = train_test_split(X_test, y_test, test_size = 0.5, random_state = 0)

#verifying that its split up correctly (800, 100, 100) ie: 80% train, 10% test, 10% validate
print(len(X_train), len(X_test), len(X_validate))

88421 11053 11053


# Step 6 - Training Tree-based Classifiers

* Use a decision tree classifier model to train your data.
* Choose the best criterion for the decision tree algorithm by trying different values and validating performance on the validation set.
  * Note: choosing the best criterion is an example of hyper-parameter tuning.

* Classification Metrics
  * Print the accuracy score of your final classifier.
  * Print the confusion matrix.

In [191]:
#training using decision tree classifier
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier(criterion="entropy", splitter="random").fit(X_train, y_train)
dtc_score = dtc.score(X_test, y_test)

print("DTC score: ", dtc_result)

DTC score:  0.802225640097711


Accuracy score:

In [192]:
#run the result on the validate set to ensure the score holds with data it has not seen (as 13 was chosed based on test data)
dtc_accuracy = dtc.score(X_validate, y_validate)
print("Accuracy =", dtc_accuracy)

Accuracy = 0.7950782592961188


Confusion matrix:

In [193]:
from sklearn.metrics import confusion_matrix

predictions = dtc.predict(X_validate)

print("Confusion Matrix: \n",confusion_matrix(y_validate, predictions))

Confusion Matrix: 
 [[8752   81]
 [2184   36]]


# Step 7 - Random Forest

Repeat step 6.
Increase/decrease the number of estimators in random forest and comment on the difference of the classification metrics.

In [194]:
#training using random forest classifier
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100,criterion="gini").fit(X_train, y_train)
rfc_score = rfc.score(X_test, y_test)

print("RFC score: ", rfc_score)

RFC score:  0.802225640097711


Accuracy score:

In [195]:
#run the result on the validate set to ensure the score holds with data it has not seen (as 13 was chosed based on test data)
rfc_accuracy = rfc.score(X_validate, y_validate)
print("Accuracy =", rfc_accuracy)

Accuracy = 0.7939925811996743


Confusion matrix:

In [196]:
predictions = rfc.predict(X_validate)

print("Confusion Matrix: \n",confusion_matrix(y_validate, predictions))

Confusion Matrix: 
 [[8736   97]
 [2180   40]]
