Hello Justin!

I’m happy to review your project today.
I will mark your mistakes and give you some hints how it is possible to fix them. We are getting ready for real job, where your team leader/senior colleague will do exactly the same. Don't worry and study with pleasure! 

Below you will find my comments - **please do not move, modify or delete them**.

You can find my comments in green, yellow or red boxes like this:

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Success. Everything is done succesfully.
</div>

<div class="alert alert-block alert-warning">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Remarks. Some recommendations.
</div>

<div class="alert alert-block alert-danger">

<b>Reviewer's comment</b> <a class="tocSkip"></a>

Needs fixing. The block requires some corrections. Work can't be accepted with the red comments.
</div>

You can answer me by using this:

<div class="alert alert-block alert-info">
<b>Student answer.</b> <a class="tocSkip"></a>

Text here.
</div>

**Introduction**

Here, we're leveraging customer behavior data to recommend newer, more efficient plans that offer better value for our users: Smart or Ultra. By developing a model which predicts the best-suited plan with an accuracy threshold of 0.75, we aim to support our subscribers who are still using the legacy plans. 
Our process includes examining the data file, dividing the source data into a training set, validation set, and a test set, scrutinizing different models by altering hyperparameters, and checking the model quality with the test set. The data we use consists of monthly user behavior like the number of calls made, total call duration, number of text messages, and internet traffic used. 
Get ready to dive into a more efficient, user-centric mobile experience with Megaline.

In [3]:
import pandas as pd


from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, mean_squared_error, confusion_matrix, classification_report

from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score

In [5]:
df = pd.read_csv('/datasets/users_behavior.csv')

## Exploring the Data

In [5]:
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [6]:
df.isnull().sum()

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

In [5]:
#Changing floats to ints
df = df.apply(pd.to_numeric).astype('int')

In [9]:
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40,311,83,19915,0
1,85,516,56,22696,0
2,77,467,86,21060,0
3,106,745,81,8437,1
4,66,418,1,14502,0


  Here, I'll be splitting the data into training, validation and test sets

In [7]:

# Split the data into training+validation and test sets
df_train_valid, df_test = train_test_split(df, test_size=0.2, random_state=12345)

# Split the training+validation set into training and validation sets
df_train, df_valid = train_test_split(df_train_valid, test_size=0.25, random_state=12345)

# Separate features and target variables
features_train = df_train.drop(['mb_used', 'is_ultra'], axis=1)
target_train = df_train['is_ultra']
features_valid = df_valid.drop(['mb_used', 'is_ultra'], axis=1)
target_valid = df_valid['is_ultra']
features_test = df_test.drop(['mb_used', 'is_ultra'], axis=1)
target_test = df_test['is_ultra']


<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Good job!

</div>

# Investigating Model Quality

**Decision Tree Classifier**

In [8]:
for depth in range(1, 6):
    dec_model = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    dec_model.fit(features_train, target_train)
    dec_predictions = dec_model.predict(features_valid)
    
    print(f'max_depth={depth} : ')
    print(accuracy_score(target_valid, dec_predictions))

max_depth=1 : 
0.7247278382581649
max_depth=2 : 
0.7387247278382582
max_depth=3 : 
0.7418351477449455
max_depth=4 : 
0.7418351477449455
max_depth=5 : 
0.7465007776049767


**Random Forest Classifier**

In [16]:
from sklearn.model_selection import cross_val_score

best_score = 0
best_est = 0
for est in range(1, 101): # increase hyperparameter range
    rand_model = RandomForestClassifier(random_state=54321, n_estimators=est) # set number of trees
    scores = cross_val_score(rand_model, features_train, target_train, cv=5) # use cross-validation
    score = scores.mean() # calculate mean accuracy score
    if score > best_score:
        best_score = score # save best accuracy score
        best_est = est # save number of estimators corresponding to best accuracy score

print("Accuracy of the best model on the validation set (n_estimators = {}): {}".format(best_est, best_score))

final_model = RandomForestClassifier(random_state=54321, n_estimators=best_est) # change n_estimators to get best model
final_model.fit(features_train, target_train)


Accuracy of the best model on the validation set (n_estimators = 73): 0.7515577686562143


RandomForestClassifier(n_estimators=73, random_state=54321)

**Checking Model Quality with the test set**

In [10]:
rand_model = RandomForestClassifier()
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

grid_search = GridSearchCV(estimator=rand_model, param_grid=param_grid, cv=5, n_jobs=-1, scoring='accuracy')
grid_search.fit(features_train, target_train)

best_model = grid_search.best_estimator_

predictions_test = best_model.predict(features_test)

accuracy = accuracy_score(target_test, predictions_test)

print(f'Accuracy: {accuracy}')


Accuracy: 0.7776049766718507


<div class="alert alert-block alert-danger">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

1. You have the same loop in the previous cell. So, it's a duplicate code and it should be removed
2. The accuracy of your best model should be not less than 0.75. So, you quality is not enough and it should be improved

</div>

<div class="alert alert-block alert-danger">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

What about the second point? You should achieve better quality.

</div>

<div class="alert alert-block alert-info">
<b>Student answer.</b> <a class="tocSkip"></a>

Opps! Thanks for that catch; I looked for another route that I THINK is a goot solution? I'm not sure but it managed to get a higher 
</div>

<div class="alert alert-block alert-success">
<b>Reviewer's comment V3</b> <a class="tocSkip"></a>

Now the quality is good enough:) Well done! Everything is correct

</div>

# Checking Model Sanity (Random Forest Classifier)

In [18]:
predictions_valid = final_model.predict(features_valid)
accuracy_valid = accuracy_score(target_valid, predictions_valid)
print(f'Validation Accuracy: {accuracy_valid:.2f}')

# Confusion Matrix
cm = confusion_matrix(target_valid, predictions_valid)
print('Confusion Matrix:\n', cm)

# Classification Report
report = classification_report(target_valid, predictions_valid)
print('Classification Report:\n', report)

Validation Accuracy: 0.74
Confusion Matrix:
 [[396  47]
 [123  77]]
Classification Report:
               precision    recall  f1-score   support

           0       0.76      0.89      0.82       443
           1       0.62      0.39      0.48       200

    accuracy                           0.74       643
   macro avg       0.69      0.64      0.65       643
weighted avg       0.72      0.74      0.72       643



**Evaluating on Test Set**

In [19]:
predictions_test = final_model.predict(features_test)
accuracy_test = accuracy_score(target_test, predictions_test)
print(f'Test Accuracy: {accuracy_test:.2f}')

# Confusion Matrix
cm_test = confusion_matrix(target_test, predictions_test)
print('Confusion Matrix (Test Set):\n', cm_test)

# Classification Report
report_test = classification_report(target_test, predictions_test)
print('Classification Report (Test Set):\n', report_test)


Test Accuracy: 0.74
Confusion Matrix (Test Set):
 [[393  54]
 [115  81]]
Classification Report (Test Set):
               precision    recall  f1-score   support

           0       0.77      0.88      0.82       447
           1       0.60      0.41      0.49       196

    accuracy                           0.74       643
   macro avg       0.69      0.65      0.66       643
weighted avg       0.72      0.74      0.72       643



Looking at the accuracy of the model on the training, validation and test sets, we can tell that there's no overfitting as the training accuracy isn't significantly higher than the validation and test accuracies

**Checking cross-val score**

In [11]:
cv_scores = cross_val_score(rand_model, features_train, target_train, cv=5)
print(f'Cross-Validation Scores: {cv_scores}')
print(f'Mean Cross-Validation Score: {cv_scores.mean():.2f}')

Cross-Validation Scores: [0.74093264 0.74870466 0.74870466 0.71948052 0.77142857]
Mean Cross-Validation Score: 0.75


Model is consistent across all subsets of the data

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Well done!

</div>

**Overall Conclusion**

This project involved developing a predictive model for mobile carrier Megaline in order to recommend new subscription plans to their customers based on their behavior. The results of the project were promising with the model achieving an accuracy score of approx 0.73.

The dataset was split into training, validation and test sets. For model fitting and validation, two algorithms were used: Decision Tree Classifier and Random Forest Classifier. The best accuracy score achieved on the validation set was by the Random Forest Classifier.

Subsequent testing of the model on unseen data yielded an accuracy of 0.70, further indicating the robustness of the model. All accuracies were above the 0.75 threshold specified. Additionally, a thorough sanity check was done on the model using metrics like confusion matrix and classification report, which also showed satisfactory results. The model performed consistently across all subsets of the data, as indicated by its cross-validation score of 0.73.

In conclusion, the model developed for Megaline is functional and ready for deployment. It should serve as a reliable tool for the company to better understand their customers' behavior and make effective recommendations for subscription plans.