# Data Prediction

The objective of this notebook is to predict active vs inactive customers. 

### Load the data
We start with our imports:

In [1]:
# Import libraries
import pandas as pd
import seaborn as sns


Now we'll read our dataset with pandas:

In [2]:
# Read data
data = pd.read_csv("../data/data_prediction.csv")
data.head(5)

Unnamed: 0,ID,Subscription,Refill,Interval,Price per interval,PaymentMethod,Products,Transactions,Revenue,DaysSinceLastOrder,SubscriptionLifetime,Active,SubscriptionLifetimeBin
0,2092329948149575290,1,0,3,5.0,ideal,Brush,1,54.0,217,217,0,High
1,12849545117712085541,1,0,2,10.0,ideal,Brush,1,98.0,66,66,0,Medium
2,8144808881881603173,1,0,2,5.0,ideal,Brush,1,69.0,66,66,0,Medium
3,14457143150027122736,1,0,3,10.0,ideal,Brush,1,98.0,66,66,0,Medium
4,9143720643862541019,1,0,3,10.0,ideal,Brush,1,98.0,66,66,0,Medium


In [3]:
# Drop unnecessary columns - Data
drop_cols = ["ID"]
data.drop(drop_cols, axis=1, inplace=True)

Let's start by looking at the shape of our data and the column types.

In [4]:
# Print shape of data
print(data.shape)

(945, 12)


In [5]:
# Create dummies
data = pd.get_dummies(data, drop_first=True)
data.shape

(945, 15)

## Handling imbalanced data
The outcome classes are not equally represented and because future algorithms will assume balanced data, we need to use some techniques to handle imbalanced data.

* **Up-sample Minority Class** - Up-sampling is the process of randomly duplicating observations from the minority class in order to reinforce its signal.
* **Down-sample Majority Class** - Down-sampling involves randomly removing observations from the majority class to prevent its signal from dominating the learning algorith. We will choose not to down-sample the majority class because the data set is already relatively small.

[How to handle imbalanced data](https://elitedatascience.com/imbalanced-classes)

![Resampling](https://raw.githubusercontent.com/rafjaa/machine_learning_fecib/master/src/static/img/resampling.png)

In [6]:
# Check active vs inactive customers
data["Active"].value_counts()

0    894
1     51
Name: Active, dtype: int64

In [7]:
# Up-sample Minority Class
from sklearn.utils import resample

# Separate majority and minority classes
data_majority = data[data["Active"]==0]
data_minority = data[data["Active"]==1]
 
# Upsample minority class
data_minority_upsampled = resample(data_minority, 
                                   replace=True,     # sample with replacement
                                   n_samples=500,    # to match majority class (894)
                                   random_state=123) # reproducible results
 
# Combine majority class with upsampled minority class
data_upsampled = pd.concat([data_majority, data_minority_upsampled])
 
# Display new class counts
data_upsampled["Active"].value_counts()

0    894
1    500
Name: Active, dtype: int64

As you can see, the new DataFrame has more observations than the original, and the ratio of the two classes is more balanced.

### Creating the Model
The final step in the processing phase is to separate the data into test and train datasets. We do this to ensure that our model performs well even on the data that was not used for training. This is how we reduce overfitting. We randomly select 80% of the data for the training dataset and the remaining 20% is used for the test dataset.

In [8]:
# Split dataframe
data_x = data_upsampled[data_upsampled.columns.difference(["Active"])]
data_y = data_upsampled["Active"]

# Standardizing/scaling the features
from sklearn.preprocessing import StandardScaler
data_x = StandardScaler().fit_transform(data_x)

# Create Train & Test Data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data_x, data_y, test_size=0.2,random_state=123)


In [9]:
print("Lenght of X_train:",len(X_train))
print("Lenght of X_test:",len(X_test))
print("Lenght of y_train:",len(y_train))
print("Lenght of y_test:",len(y_test))

Lenght of X_train: 1115
Lenght of X_test: 279
Lenght of y_train: 1115
Lenght of y_test: 279


Classification algorithms attempt to learn patterns in the data to be able to predict the membership of new data to one or more classes/groups. Since we are trying to predict active vs inactive customers (binary problem), there are a number of algorithms at our disposal. 

* 1. **Logistic regression** is a good choice for our problem. 
* 2. Another good choice is **random forest**. 
* 3. **Support Vector Machine (SVM)** is also a suitable option for this type of problem. 
* 4. Last, **K-Nearest Neighbors** will be used to predict active vs inactive customers.

We will create all of the above mentioned models and evaluate their accuracy afterwards as follows: 

* One way to evaluate the model is using a **confusion matrix**. The confusion matrix specifies how many observations were correctly classified and how many were incorrectly classified. For this project, we will select the model that minimizes False Positives (falsely '
* Another widely used way to evalaute models is the **accuracy score**.
* Last, we will use the classification report to check precision, recall and the F1-score
    * **Precision** is the number of correctly classified among that class.
    * **Recall** (Sensitivity) is the ability of a classifier to find all positive instances.
    * **F1-score** is the harmonic mean of precision and recall.

Below we will evaluate the models on the aforementioned evaluation methods for `lr_y_pred`, `rf_y_pred`, `svm_y_pred` and `knn_y_pred`

![Confusion Matrix](../confusion_matrix.png)

In [10]:
# Import evaluation metrics
from sklearn.metrics import confusion_matrix # Confusion matrix
from sklearn.metrics import accuracy_score # Accuracy score
from sklearn.metrics import classification_report # Check precision, recall, f1-score

#### Logistic Regression
A logistic regression returns the likelihood of a binary problem based on a cutoff (default=0.5) that splits the data.

In [11]:
# Import model
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

# Initiate model
lr = LogisticRegression()
lr_result = lr.fit(X_train, y_train)

# Predict response
lr_y_pred = lr.predict(X_test)

# Evaluate prediction
lr_matrix = confusion_matrix(y_test, lr_y_pred)
print("lr_matrix:\n",lr_matrix)

lr_score = accuracy_score(y_test, lr_y_pred)
print("\nlr_score:\n",lr_score)

lr_report = classification_report(y_test, lr_y_pred)
print("\nlr_report:\n",lr_report)


lr_matrix:
 [[169   6]
 [ 46  58]]

lr_score:
 0.8136200716845878

lr_report:
               precision    recall  f1-score   support

           0       0.79      0.97      0.87       175
           1       0.91      0.56      0.69       104

    accuracy                           0.81       279
   macro avg       0.85      0.76      0.78       279
weighted avg       0.83      0.81      0.80       279



In [12]:
## Lets manually check if active vs inactive customers are predicted correctly
# First split the original dataframe
data_x = data[data.columns.difference(["Active"])]
data_y = data["Active"]

# Generate result
result = data[["Active"]]
result["Predicted"] = lr.predict(data_x)
result.sort_values(by="Active",ascending=False).head(10)
print("\nActual:\n",result["Active"].value_counts())
print("\nPredicted:\n",result["Predicted"].value_counts())


Actual:
 0    894
1     51
Name: Active, dtype: int64

Predicted:
 1    918
0     27
Name: Predicted, dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


#### Random Forest
Random forest is an ensemble algorithm. This means that it resamples the data multiple times and generates a decision tree from each sample. We then think of the trees as a group of algorithms. We base our decision on the outcome of the majority of algorithms in the ensemble.

Decision trees often perform well on imbalanced datasets because their hierarchical structure allows them to learn signals from both classes.

In [13]:
# Import model
from sklearn.ensemble import RandomForestClassifier

# Initiate model
rf = RandomForestClassifier()
rf_result = rf.fit(X_train, y_train)

# Predict response
rf_y_pred = rf_result.predict(X_test)
rf_y_pred_train = rf_result.predict(X_train)

# Evaluate prediction
rf_matrix = confusion_matrix(y_test, rf_y_pred)
print("rf_matrix:\n",rf_matrix)

rf_score = accuracy_score(y_test, rf_y_pred)
print("\nrf_score:\n",rf_score)

rf_score_train = accuracy_score(y_train, rf_y_pred_train)
print("\nrf_score_train:\n",rf_score_train)

rf_report = classification_report(y_test, rf_y_pred)
print("\nrf_report:\n",rf_report)

rf_matrix:
 [[158  17]
 [  6  98]]

rf_score:
 0.9175627240143369

rf_score_train:
 0.9542600896860987

rf_report:
               precision    recall  f1-score   support

           0       0.96      0.90      0.93       175
           1       0.85      0.94      0.89       104

    accuracy                           0.92       279
   macro avg       0.91      0.92      0.91       279
weighted avg       0.92      0.92      0.92       279



#### Support Vector Machine (SVM)
SVM looks at the datapoints that are closest to the line that divide the two classes. We will used the Penalized-SVM to increase the cost of classification mistakes on the minority class.

In [14]:
# Import model
from sklearn.svm import SVC

# Initiate model
svm = SVC(kernel='linear',
          class_weight='balanced', # penalize
          probability=True)
svm_result = svm.fit(X_train, y_train)

# Predict response
svm_y_pred = svm_result.predict(X_test)

# Evaluate prediction
svm_matrix = confusion_matrix(y_test, svm_y_pred)
print("svm_matrix:\n",svm_matrix)

svm_score = accuracy_score(y_test, svm_y_pred)
print("\nsvm_score:\n",svm_score)

svm_report = classification_report(y_test, svm_y_pred)
print("\nsvm_report:\n",svm_report)


svm_matrix:
 [[159  16]
 [ 46  58]]

svm_score:
 0.7777777777777778

svm_report:
               precision    recall  f1-score   support

           0       0.78      0.91      0.84       175
           1       0.78      0.56      0.65       104

    accuracy                           0.78       279
   macro avg       0.78      0.73      0.74       279
weighted avg       0.78      0.78      0.77       279



In [15]:
## Lets manually check if active vs inactive customers are predicted correctly
# First split the original dataframe
data_x = data[data.columns.difference(["Active"])]
data_y = data["Active"]

# Generate result
result = data[["Active"]]
result["Predicted"] = svm.predict(data_x)
result.sort_values(by="Active",ascending=False).head(10)
print("\nActual:\n",result["Active"].value_counts())
print("\nPredicted:\n",result["Predicted"].value_counts())


Actual:
 0    894
1     51
Name: Active, dtype: int64

Predicted:
 0    925
1     20
Name: Predicted, dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


#### K-Nearest Neighbors (KNN)
KNN uses the nearest datapoints to determine classes where k = the number of datapoints closest to the testpoint.

In [16]:
# Import model
from sklearn.neighbors import KNeighborsClassifier

# Initiate model
knn = KNeighborsClassifier()
knn_result = knn.fit(X_train, y_train)

# Predict response
knn_y_pred = knn_result.predict(X_test)

# Evaluate prediction
knn_matrix = confusion_matrix(y_test, knn_y_pred)
print("knn_matrix:\n",knn_matrix)

knn_score = accuracy_score(y_test, knn_y_pred)
print("\nknn_score:\n",knn_score)

knn_report = classification_report(y_test, knn_y_pred)
print("\nknn_report:\n",knn_report)

knn_matrix:
 [[151  24]
 [ 14  90]]

knn_score:
 0.8637992831541219

knn_report:
               precision    recall  f1-score   support

           0       0.92      0.86      0.89       175
           1       0.79      0.87      0.83       104

    accuracy                           0.86       279
   macro avg       0.85      0.86      0.86       279
weighted avg       0.87      0.86      0.86       279



In [17]:
## Lets manually check if active vs inactive customers are predicted correctly
# First split the original dataframe
data_x = data[data.columns.difference(["Active"])]
data_y = data["Active"]

# Generate result
result = data[["Active"]]
result["Predicted"] = knn.predict(data_x)
result.sort_values(by="Active",ascending=False).head(10)
print("\nActual:\n",result["Active"].value_counts())
print("\nPredicted:\n",result["Predicted"].value_counts())


Actual:
 0    894
1     51
Name: Active, dtype: int64

Predicted:
 1    945
Name: Predicted, dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


It seems that from all the models, the **RandomForest** minimizes the False Negatives the most. False Negatives are mistakes of prediciting customers that they are not going to unsubscribe (become inactive), but they actually did unsubscribe (become inactive).

Based on the accuracy scores of the test and train data (0.94 vs. 0.95 respectively), it seems the data is underfitting. This is expected, because we have few rows to train the data. To start making valid predictions, it is necessary to get more training data. Ideally there would be more data and a bigger population of high-risk customers in the complete dataset.

Still, let's work with what we have and calculate the class probabilities of the features for that model.

In [18]:
# Split dataframe
data_x = data_upsampled[data_upsampled.columns.difference(["Active"])]
data_y = data_upsampled["Active"]

# Standardizing/scaling the features
from sklearn.preprocessing import StandardScaler
data_x = StandardScaler().fit_transform(data_x)

# Create Train & Test Data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data_x, data_y, test_size=0.3) # Note that here we increased the test_size with +0.1

In [19]:
# Initiate model
rf = RandomForestClassifier()
rf_result = rf.fit(X_train, y_train)

# Predict response
rf_y_pred = rf_result.predict(X_test)
rf_y_pred_train = rf_result.predict(X_train)

## Lets manually check if active vs inactive customers are predicted correctly
# First split the original dataframe
data_x = data[data.columns.difference(["Active"])]
data_y = data["Active"]

# Generate result on original dataframe
result = data[["Active"]]
result["Predicted"] = rf.predict(data_x)
result.sort_values(by="Active",ascending=False).head(10)
print("\nActual:\n",result["Active"].value_counts())
print("\nPredicted:\n",result["Predicted"].value_counts())
result.sort_values(by="Active",ascending=False)


Actual:
 0    894
1     51
Name: Active, dtype: int64

Predicted:
 0    945
Name: Predicted, dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  app.launch_new_instance()


Unnamed: 0,Active,Predicted
606,1,0
433,1,0
338,1,0
791,1,0
282,1,0
...,...,...
324,0,0
325,0,0
327,0,0
328,0,0


In [20]:
# First split the original dataframe
data_x = data[data.columns.difference(["Active"])]
data_y = data["Active"]

In [21]:
# Calculate probability of outcome
pred_proba = rf.predict_proba(data_x)

# Create dataframe with probability outcomes
final_result = pd.DataFrame(pred_proba)
final_result = final_result.rename(columns={0:"Prob_Active",
                                            1:"Prob_Inactive"})
final_result["Actual"] = data[["Active"]]
final_result["Predicted"] = rf.predict(data_x)

In [22]:
final_result.columns

Index(['Prob_Active', 'Prob_Inactive', 'Actual', 'Predicted'], dtype='object')

In [23]:
# Change order of the columns
final_result = final_result[["Actual","Predicted","Prob_Active","Prob_Inactive"]]

Even though a larger dataset and more inactive customers would work better, let's divide all customers into three customer segments so that we can give practical ideas to improve the number of active customers in the long-term.


In [24]:
labels = ["Low","Medium","High"]
cutoff = [0,0.5,0.8,1]
final_result["ProbabilityInactive"] = pd.cut(final_result["Prob_Inactive"],cutoff,labels=labels)
final_result.sort_values(by="Prob_Inactive",ascending=False).head(5)

Unnamed: 0,Actual,Predicted,Prob_Active,Prob_Inactive,ProbabilityInactive
419,1,0,0.61,0.39,Low
272,0,0,0.61,0.39,Low
273,0,0,0.61,0.39,Low
27,1,0,0.61,0.39,Low
274,0,0,0.61,0.39,Low


In [25]:
final_result["ProbabilityInactive"].value_counts()

Low       945
High        0
Medium      0
Name: ProbabilityInactive, dtype: int64

In [26]:
# Now that we have the result let's save it as a new csv
final_result.to_csv("../data/data_result.csv",index=False)

### Next steps

We've modeled the risk of inactive customers and now have three customer segments:
    1. Low - low risk of being an inactive customer (p < .3)
    2. Medium - medium risk of being an inactive customer (p > 0.3 < .5)
    3. High - high risk of being an inactive customer (p > 0.5)

This could feed into a dashboard to give stakeholders a glimpse of "at-risk" customers. It also provides three different groups that we can run specific actions. Some ideas:

1) Reach out to unsubscribed/inactive customers to figure out why the left. 

2) Send different types of targeted emails and special offers to the high risk group. If the sample size of high risk customers is large enough, you could split off a few small treatment groups and compare how their retention and CLV change with different promotional or customer relationship strategies.

3) Determine the the highest value customers in the active customer group, and serve them additional benefits to ensure that they remain loyal customers.