<div style="border:solid green 2px; padding: 20px"> <h1 style="color:green; margin-bottom:20px">Reviewer's comment v1</h1>

Hello Sang!

I'm happy to review your project today  🙌

You can find my comments under the heading **«Review»**. I will categorize my comments in green, blue or red boxes like this:

<div class="alert alert-success">
    <b>Success:</b> if everything is done successfully
</div>
<div class="alert alert-warning">
    <b>Remarks:</b> if I can give some recommendations or ways to improve the project
</div>
<div class="alert alert-danger">
    <b>Needs fixing:</b> if the block requires some corrections. Work can't be accepted with the red comments
</div>

Please don't remove my comments. If you have any questions, don't hesitate to respond to my comments in a different section.
<div class="alert alert-info"> <b>Student comments: </div>    

<div style="border:solid green 2px; padding: 20px">
<b>Reviewer's comment v1:</b>
    
<b>Overall Feedback</b> 
    

Hello Sang,

Another project successfully completed - well done! 🏆 Your consistent effort and progress are truly commendable.

Our team is here to help you keep pushing forward and honing your skills as you advance through the program.

You’ll find general comments below in the notebook in the `Reviewer's comment v1:` blocks.

**What Was Great:**

- Data Loading and Preparation: You successfully loaded the dataset and correctly separated the features and target variable. This is a crucial first step, and you handled it appropriately.
- Model Selection: You chose a variety of classification models—Decision Tree, Random Forest, and Logistic Regression—which is excellent for comparing different algorithms' performance on your data.
- Model Training and Evaluation: Using a loop to train and evaluate each model is an efficient approach. You correctly used the accuracy_score metric to assess the models and identified the Random Forest classifier as the best-performing model.
- Test Set Evaluation: You validated the performance of your selected model on a test set, ensuring that the model generalizes well to unseen data. The test accuracy being close to the validation accuracy is a good sign.
    
**Tips for Future Projects:**
    
- Consider using GridSearchCV or RandomizedSearchCV for more systematic hyperparameter tuning. This can enhance efficiency and ensure you explore a wider range of parameters.
- To improve model evaluation, incorporate additional metrics like F1-score, precision, recall, or ROC-AUC. These provide a more comprehensive view of your models' performance, especially with imbalanced datasets.
- Add visualization for feature importance and add learning curves to diagnose overfitting/underfitting. 
- Create separate functions for data preprocessing, model training, and evaluation.
    
Congratulations again on your accomplishment! Each project you complete adds to your growing expertise, and it’s exciting to see you make such great strides. Keep up the great work! 🎯

<div class="alert alert-warning">
<b>Reviewer's comment v1:</b>
    
It is always helpful for readers to have additional information about project tasks. It gives them an overview of what they will achieve in this project.

# Importing

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
import numpy as np

# Loading Dataset

In [2]:
df = pd.read_csv('/datasets/users_behavior.csv')

In [3]:
df

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.90,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
...,...,...,...,...,...
3209,122.0,910.98,20.0,35124.90,1
3210,25.0,190.36,0.0,3275.61,0
3211,97.0,634.44,70.0,13974.06,0
3212,64.0,462.32,90.0,31239.78,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [5]:
# checking for duplicates
print(df.duplicated().value_counts())

False    3214
dtype: int64


Observations

Data types are appropriate, no missing values, and no duplicate rows. Ready to move on!

<div class="alert alert-success">
<b>Reviewer's comment v1:</b>
    
Well done! Data have been successfully loaded and inspected.

# Splitting into Training, Validation, Test Sets

In [6]:
# using train_test_split to split our data into 60% training set and 40% temp set

training, temp = train_test_split(df, test_size=0.4, random_state=1)

# using train_test_split again to split our temp dataset 50%, leaving a total of 20% valid and 20% test

validation, testing = train_test_split(temp, test_size=0.5, random_state=1)

<div class="alert alert-warning">
<b>Reviewer's comment v1:</b>
    
Consider using stratified splits (`stratify` parameter) to ensure that class distributions are similar across training, validation, and test sets.

https://scikit-learn.org/1.5/modules/generated/sklearn.model_selection.train_test_split.html

In [7]:
print(len(training))

display(training.head())

1928


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
1587,53.0,376.78,44.0,17334.12,0
11,45.0,344.32,13.0,19898.81,0
353,68.0,493.0,29.0,20021.73,0
350,65.0,423.06,40.0,18625.97,0
2708,45.0,296.26,75.0,13121.85,0


In [8]:
print(len(validation))

display(validation.head())

643


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
2817,12.0,86.62,22.0,36628.85,1
248,127.0,748.44,0.0,19369.15,0
1171,49.0,344.92,17.0,23383.4,0
1935,68.0,556.88,63.0,11114.1,0
2291,0.0,0.0,28.0,11864.26,1


In [9]:
print(len(testing))

display(testing.head())

643


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
1759,51.0,328.88,24.0,20511.93,0
2925,87.0,500.78,63.0,26115.19,0
1808,53.0,370.03,0.0,32581.16,1
615,75.0,486.63,66.0,24650.84,0
1944,28.0,216.05,41.0,11946.55,0


Observations

I was able to successfully split the original dataset into 60% training, 20% validation and 20% test.

# Model Training

Splitting datasets into targets and features

In [10]:
# splitting each dataset into features and targets

training_features = training.drop('is_ultra', axis=1)
training_target = training['is_ultra']

validation_features = validation.drop('is_ultra', axis=1)
validation_target = validation['is_ultra']

testing_features = testing.drop('is_ultra', axis=1)
testing_target = testing['is_ultra']

<div class="alert alert-warning">
<b>Reviewer's comment v1:</b>

As a second approach you could print the sizes of dataframes using `shape` and `f-strings`

```
f'Size of the dataframe {testing.shape}'
```

# Training and evaluating various models

Decision Tree

Evaluating the hyperparameters of the decision tree

In [11]:
for depth in range(1, 10):
    model_tree = DecisionTreeClassifier(random_state=1, max_depth=depth) # testing various max_depth values
    model_tree.fit(training_features, training_target) # training model on training set

    predictions_valid = model_tree.predict(validation_features) # running predictions with validation set

    print("max_depth =", depth, ": ", end='')
    print(accuracy_score(validation_target, predictions_valid)) # calculating accuracy of validation predictions vs target

max_depth = 1 : 0.71850699844479
max_depth = 2 : 0.7558320373250389
max_depth = 3 : 0.7713841368584758
max_depth = 4 : 0.7682737169517885
max_depth = 5 : 0.7698289269051322
max_depth = 6 : 0.7713841368584758
max_depth = 7 : 0.7776049766718507
max_depth = 8 : 0.7869362363919129
max_depth = 9 : 0.7962674961119751


Random Forest

Evaluating the hyperparameters of the decision tree

In [12]:
best_score = 0
best_est = 0

for est in range(10, 51, 10):
    for depth in range (1, 10):
        model_forest = RandomForestClassifier(random_state=1,
                                              n_estimators=est,
                                              max_depth=depth) # set number of trees/max depth
        model_forest.fit(training_features, training_target) # train model on training set
        score = model_forest.score(validation_features, validation_target) # calculate accuracy score on validation set
        if score > best_score:
            best_score = score # save best accuracy score on validation set
            best_est = est # save number of estimators corresponding to best accuracy score
            best_depth = depth # save the value of max_depth corresponding to best accuracy score

print("Accuracy of the best model on the validation set (n_estimators = {}, max_depth = {}): {}".format(best_est,
                                                                                                        best_depth,
                                                                                                        best_score))

Accuracy of the best model on the validation set (n_estimators = 10, max_depth = 9): 0.7931570762052877


Logistic Regression

Evaluating the accuracy of logistic regression


In [13]:
model_lr = LogisticRegression(random_state=54321, solver='liblinear')
model_lr.fit(training_features, training_target)  # train model on training set
score_train = model_lr.score(training_features, training_target)
score_valid = model_lr.score(validation_features, validation_target)

print(
    "Accuracy of the logistic regression model on the training set:",
    score_train,
)
print(
    "Accuracy of the logistic regression model on the validation set:",
    score_valid,
)

Accuracy of the logistic regression model on the training set: 0.7240663900414938
Accuracy of the logistic regression model on the validation set: 0.6889580093312597


<div class="alert alert-success">
<b>Reviewer's comment v1:</b>
    
Everything is correct here! Great that you've managed to check multiple models. 

Some possible minor improvements: 

- Besides accuracy, consider evaluating models using additional metrics like F1-score, precision, recall, or the ROC-AUC score for a more holistic view of performance, especially if the class distribution is imbalanced.
- Consider using GridSearchCV or RandomizedSearchCV from sklearn.model_selection for a more systematic hyperparameter search.

Observations

After tuning the hyperparameters and evaluating each model, the most accurate is the Decision Tree, with a max_depth of 9. The Random Forest evaluated to be very close in accuracy, but it takes longer to run, so the Decision Tree wins in both accuracy and speed this time. We will be moving forward with the Decision Tree with a max_depth of 9.

In [14]:
# using the hyperparameters from the Decision Tree tuning above
final_model = DecisionTreeClassifier(random_state=1, max_depth=9)
final_model.fit(training_features, training_target)

score_test = final_model.score(testing_features, testing_target)

print('Final Model Accuracy Score:', score)

Final Model Accuracy Score: 0.7931570762052877


<div class="alert alert-success">
<b>Reviewer's comment v1:</b>
    
Great! Well above the required threshold. 

Observations

The final model (decision tree with max_depth of 9) ends with an accuracy score of 0.79, pretty good!

Sanity Check

Final step is to run through a sanity check with the model, so I can be sure that the model actually learned something, and performs better than something like randomly assigning a customer a plan.

In [15]:
# new series, sanity, will be equivalent to randomly selecting a plan for customers
sanity = np.random.choice([0, 1], size=len(testing_target))

In [16]:
sanity_score = final_model.score(testing_features, sanity)

print('Sanity Check score:', sanity_score)

Sanity Check score: 0.48055987558320373


<div class="alert alert-warning">
<b>Reviewer's comment v1:</b>
    
**Additional task: sanity check the model.** 
 
The use of `DummyClassifier` is a good practice for baseline performance and finalizing sanity check. You could read about it [here](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html). In our case, we have a very skewed distribution of users across the two plans. As you can see, only 30% are enrolled in the Ultra plan, therefore it could be usefull to use a particular strategy (e.g., `most_frequent`) and comparing it against the chosen model’s performance.
    
```
from sklearn.dummy import DummyClassifier
# Initialize the DummyClassifier to predict the most frequent class
dummy_clf = DummyClassifier(strategy="most_frequent", random_state=0)
# Fit the dummy classifier on the training data
...
```

# Conclusions

Throughout this project, I:

Loaded in the dataset of user behavior
Made sure the data was appropriate and ready to be utilized in machine learning models
Split the data into 3 sets, for training, validation and testing, at at 3:1:1 ratio
Began training various models and tuning hyperparameters to figure out which model was the most accurate
Trained our final model and tested it using the test dataset
Performed a sanity check on our model
I chose to go with the Decision Tree Model, because it had the highest accuracy, and it is also a faster model than the Random Forest. Logistic Regression did not meet the accuracy threshold for the project, so I ignored it.

I performed a sanity check by testing the model against a random assignment of plans to customers. This performed roughly as expected, right around 50%.

My model's performance at 0.793 indicated that there are indeed learned patterns in my model, as it performs significantly better than random assignemtn of plans to customers.

As I continue to advance through this course, I hope to be able to return to this project with more knowledge to improve the performance of this model (or create a new one altogether).

# Experimentation

Using GridSearchCV to evaluate more hyperparameters of random forest

In [23]:
from sklearn.model_selection import GridSearchCV

In [24]:
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

In [25]:
rf_model = RandomForestClassifier()
grid_search = GridSearchCV(rf_model, param_grid, cv=5)

In [None]:
grid_search.fit(training_features, training_target)

In [None]:
best_params = grid_search.best_params_
best_est = grid_search.best_estimator_

In [None]:
best_model_score = grid_search.score(testing_features, testing_target)

print(best_model_score)

<div class="alert alert-success">
<b>Reviewer's comment v1:</b>
    

Great job on your overall conclusions and recommendations!  Your recommendations are well-thought and could be very valuable to the business.