# Sprint 7 Project: Intro to Machine Learning
----

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score

## 1) Open and Confirm Quality of Data
---

In [2]:
df = pd.read_csv('users_behavior.csv')

In [3]:
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


**is_ultra: 1 = Ultra, 0 = Smart**

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [5]:
df.isna().sum()

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

A quick look at the data to confirm it is clean and in good shape, including dtypes, null values, and an understanding of each column

## 2) Split Data: Train, Validate, Test
---

I plan to split the data 60-20-20 (train-validate-test)

In [6]:
train, val = train_test_split(df, test_size=0.4, random_state=1)
valid, test = train_test_split(val, test_size=0.5, random_state=2)

In [7]:
x_train = train.drop('is_ultra', axis=1)
y_train = train['is_ultra']

x_val = valid.drop('is_ultra', axis=1)
y_val = valid['is_ultra']

x_test = test.drop('is_ultra', axis=1)
y_test = test['is_ultra']

# confirming size of each set
lis = [x_train, y_train, x_val, y_val, x_test, y_test]
for x in lis:
    print(x.shape)

(1928, 4)
(1928,)
(643, 4)
(643,)
(643, 4)
(643,)


## 3) Investigate Different Models and Hyperparameters
---

### Decision Tree Classifier

In [8]:
# started with range(1,11,2), then narrowed down to range(6,9)

for d in range(6,9):
    dtc_model = DecisionTreeClassifier(max_depth=d, random_state=123)
    dtc_model.fit(x_train, y_train)
    dtc_pred = dtc_model.predict(x_val)
    dtc_acc = accuracy_score(y_val, dtc_pred)
    print('depth =', d, '-- accuracy =', dtc_acc)

depth = 6 -- accuracy = 0.7729393468118196
depth = 7 -- accuracy = 0.7838258164852255
depth = 8 -- accuracy = 0.776049766718507


**Best Decision Tree Classifier:**
- max_depth = 7
- accuracy = 0.783

### Random Forest Classifier

In [9]:
# started with range(1,22,4), then narrowed down to range(15,20)

for n in range(15,20):
    rfc_model = RandomForestClassifier(random_state=12345, n_estimators=n)
    rfc_model.fit(x_train, y_train)
    rfc_pred = rfc_model.predict(x_val)
    rfc_acc = accuracy_score(y_val, rfc_pred)
    print('n_estimators =', n, '-- accuracy =', rfc_acc)

n_estimators = 15 -- accuracy = 0.7791601866251944
n_estimators = 16 -- accuracy = 0.7807153965785381
n_estimators = 17 -- accuracy = 0.7884914463452566
n_estimators = 18 -- accuracy = 0.7822706065318819
n_estimators = 19 -- accuracy = 0.7869362363919129


**Best Random Forest Classifier:**
- n_estimators = 17
- accuracy = 0.788

### Logistic Regression

In [10]:
# No hyperparamter tuning done here

logreg_model = LogisticRegression(random_state=12345, solver='liblinear', max_iter=100)
logreg_model.fit(x_train, y_train)
logreg_pred = logreg_model.predict(x_val)
logreg_acc = accuracy_score(y_val, logreg_pred)
print('Model Accuracy =', logreg_acc)

Model Accuracy = 0.6951788491446346


### Model Conclusion

- The models compared were decision tree, random forest, and logistic regression because this is a classification problem
- For each model I, attempted to tune a specific parameter and narrowed the range towards the most accurate:
    - Decision Tree used max_depth
    - Random Forest used n_estimators
    - Logistic Regression used max_iter (I came across this parameter and tried many numbers, but it never affected the accuracy)
    
    
- The most accurate model ended up bein the **Random Forest Classifier with n_estimators=17 and Accuracy=0.788**
- The Decision Tree was similar but a little less accurate at 0.783
- The Logistic Regression was much lower than the others at 0.695

## 4) Check Model Quality with Test Data
---

In [11]:
# Adding the train and validation data together to train the model on more data before seeing the test set

xx = [x_train, x_val]
x_full = pd.concat(xx)

yy = [y_train, y_val]
y_full = pd.concat(yy)

print(x_full.shape)
print(y_full.shape)

(2571, 4)
(2571,)


In [12]:
# Final Model based on validation

final_model = RandomForestClassifier(random_state=12345, n_estimators=17)
final_model.fit(x_full, y_full)

final_pred = final_model.predict(x_test)

final_acc = accuracy_score(y_test, final_pred)

print('Final Model Accuracy: {:.2f}%'.format(final_acc*100))

Final Model Accuracy: 79.00%


My final model ended up with **79% accuracy** on the test data. This means that the model correctly predicted 508 out of 643 datapoints correctly.

## 5) Sanity Check
---

- I wanted the model to be at least better than 50% accuracy since this is a binary classification problem where guessing at random would get around 50% correct.
- The project goal was to get better than 75% accuracy
- With my model scoring 79%, it successfully surpassed both thresholds

## 6) Conclusion
---

Overall, I was able to go through three different simple classification models and determine which one performed the best based on accuracy on the validation data. Each model tested was tuned with a single hyperparameter which was narrowed down until the best accuracy was found. Upon finding the best one (Random Forest) I created the final model with the best tuned parameters, trained it on both the training and validation set, and finally tested the model with the test data to end up with a 79% accuracy, surpassing bothe the sanity check accuracy (50%) and the project accuracy threshold (75%). I am happy with my final model accuracy, but I am sure there are other parameters and model types I do not yet know of with some combination that performs better.