# Problem Statement

The company is concerned about an increase in customer churn that could lead to a significant loss of revenue and market share. Failing to predict and prevent this customer turnover effectively will harm the company, both financially and reputationally. 

A prediction model must be developed to uncover the core causes of this churn and devise mitigation solutions before clients leave the company. The model is also capable of predicting whether or not future clients will leave the company or stay as clients, based on the customer dataset 'A2.csv'.

Understanding why clients churn and what factors take place in order for clients to churn is important to analyze. 

The churn can be dependent on various factors, and it is important to identify them in order to minimize the churn in the future. The factors are then analyzed to conclude whether they cause a positive or negative impact. 

The trends identified can be used by upper-management to make informed decisions on how to best avoid customer churn in the future.


# Results
During the design of the predictive model, it was discovered that Total charges (monthlycharges*tenure), estimated salary, month to month contract, no online security, no tech support, DSL internet service, no device protection and no online backup have the most predictive power in that order.

However, the elimination of the other columns - has dependents, has no multiple lines, non-automatic payment, is senior citizen - lowers the prediction power of the model. This indicates that even though they are not the most important component of the cause for churn, they do have certain weight in it.

Random Forest and Balanced Random Forest were the best perfoming models in the first iteration of the different models.

In the second iteration, because the class imbalance in the training/testing split of the dataset was treated before running the different models, the Random Forest prediction model performed better and is therefore proposed to be used by upper management in their decision making.

The Random Forest Predictive Model yielded the following results:

- The results are presenting the performance metrics of a binary classification model, specifically a Random Forest model, on a dataset of 2953 instances. The model is predicting between two classes in terms of Churn, labeled 0 if the customer did not leave the company and 1 if the customer did leave the company.

- Precision is the fraction of true positives (TP) out of all the positive predictions (TP + false positives (FP)). In this case, precision for ‘No Churn’  is 0.87, meaning that out of all the instances that the model predicted as class ‘No Churn’, 87% of them were, in fact, customers that didn't leave the company. Precision for ‘Churn’ is 0.89, meaning that out of all the instances that the model predicted as customers that churned, 89% of them were actually ‘Churn’.


-	Recall is the proportion of true positives (TP) out of all the actual positives (TP + false negatives (FN)). In this case, recall for ‘No Churn' is 0.90, indicating that out of all the instances that were actually ‘No Churn’, 90% of them were correctly detected by the model. Recall for ‘Churn’ is 0.86, meaning that out of all the instances that were ‘Churn’, 86% of them were correctly classified by the model.

-	A single metric that balances precision and recall is the F1-score, which is the harmonic mean of the two. The F1 score for "No Churn" and "Churn" are both 0.88. 

-	Accuracy is the percentage of instances (TP + TN) that were successfully predicted out of all instances. In this instance, the model's overall accuracy was 0.88, which indicates that 88% of the cases in the dataset were properly identified.



# Assumptions made

- The data provided is accurate.
- The unit for tenure is indicated by months.
- The current month’s charge is not included in the total charges.
- There are outliers in the data, for example estimated salary has a value 11.58.
- Considering senior citizens are people of the age above 65.
- The values for charges are calculated with extra services included and taxes which are fixed for each month.
- Estimated Salary has been calculated based on feedback forms completed by customers.


# Limitations

- The model cannot provide a best practice to prevent customer churn; that is a business decision to be made by a different team withi+n the company. 
- There is no supporting data on describing the potential cause of churn. 
- The trends observed during the analysis are isolated and cannot be used by themselves to explain the churn rate. 
- There is a lack of data on how the added services are sold, as well as pricing policies. If the company were to make the marketing data available to this team the predictive model could be improved and expanded to contemplate the causes for churn in detail.


# Data

The dataset 'A2.csv' was provided by the company because it contains customer data which will aid in the analysis of the churn.
# EDA: Variables & Description
### Continuous

- `Tenure` : Time as the customer of the company, expressed in months.
- `Total Charges` : Total amount paid by the end of tenure.
- `Credit Score` : Credit score for each customer (No source).
- `Estimated Salary` : Estimated based salary for each customer.
- `Monthly Charges` : Per month charges calculated for each customer per service.
- `Charge` : Expenses calculated for the current month for each customer.

### Categorical

- `Gender` : Identifies Female = 0 and Male = 1
- `Senior Citizen` : Age over 65; identifies Yes = 1 and No = 0.
- `Partner` : Cohabitant; identifies Yes = 1 and No = 0.
- `Phone Service` : Phone service included in contract; identifies Yes = 1 and No = 0.
- `Multiple Lines` : More than 1 phone line included in the contract.
- `Internet Services` : Internet service included in contract.
- `Online Security` : Online Security included in contract.
- `Online Backup` : Online Backup included in contract.
- `Device Protection` : Device Protection included in contract.
- `Tech Support` : Tech Support included in contract.
- `Streaming TV` : Streaming TV service included in contract.
- `Streaming Movies` : Streaming Movies service included in contract.
- `Contract` : Duration of the contract.
- `Paperless Billing` : Electronic billing activated.
- `Churn` : Customers stopped subscribing to the services.
- `Geography` : Countries where the customers are located.
- `Dependants` : Customers have people depending on them.
- `Payment Method` : Customer's payment mode.


### Personal Identification Information (PII)
- `Surname`
- `Customer ID`


If the packages have never been installed in the terminal, these are the necessary codes:

In [None]:
# pip install xgboost
# pip install imblearn

In [None]:
# Importing all the relevant libraries for this analysis
import pandas as pd
import numpy as np
import io
from sklearn.impute import KNNImputer
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.preprocessing import OneHotEncoder
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

#sklearn modules for Model Selection:
from sklearn import linear_model, neighbors
from sklearn import naive_bayes, ensemble
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from imblearn.ensemble import BalancedRandomForestClassifier
from imblearn.ensemble import BalancedBaggingClassifier
from imblearn.ensemble import EasyEnsembleClassifier

#sklearn modules for Model Evaluation & Improvement:
    
from sklearn.metrics import confusion_matrix, accuracy_score 
from sklearn.metrics import classification_report
from sklearn.metrics import auc, roc_auc_score, roc_curve
from sklearn.metrics import make_scorer, recall_score, log_loss
from sklearn.metrics import average_precision_score

import warnings
warnings.filterwarnings('ignore')

In [None]:
# loading the dataset
data = pd.read_csv('A2.csv')

In [None]:
# Preliminar exploration of data
data.head()

In [None]:
# Print the name of the columns
for col in data.columns:
    print(col)

In [None]:
# Calculated the % of missing values per column
for i in data.columns:
    n_miss = data[[i]].isna().sum().sum()
    #perc = round(n_miss/gdp_gni_ne.shape[0]*100,2)
    perc = round(n_miss / len(data) * 100, 2)
    print(f'Missing values for \033[1m{i}\033[0m is {perc}% ')

In [None]:
data.isna().any()

After understanding how many values were empty in the dataset, a plan was devised to treat them. 

Dropping all null values was not a possibility since the total number of observations that would have been dropped represented nearly 25% of the dataset. 

Therefore, null values were filled with the mode, the most repeated value, in the categorical variables.

Null values present in continuous variables were filled with the median so that the results wouldn't be skewed in anyway by the treatment process.


In [None]:
# Finding the unique values in the columns
data["PaymentMethod"].unique()

In [None]:
data["Contract"].nunique()

In [None]:
data["Contract"].unique()

In [None]:
data["Churn"].value_counts()

An imputation was used to fill the NA values in the columns Senior Citizen, Tenure, Credit Score, Estimated Salary, Monthly Charges and Charge. 

In [None]:
# Imputation to deal with missing values
data_list = ['SeniorCitizen','tenure', 'CreditScore', 'EstimatedSalary', 'MonthlyCharges', 'Charge' ]
data_nv = data[data_list]
impKNN = KNNImputer(n_neighbors=10)
newval = impKNN.fit_transform(data_nv)
data2 = pd.DataFrame(newval, columns=data_list, index = data_nv.index)

In [None]:
# Confirming we dont have any missing values now
for i in data2.columns:
    n_miss = data2[[i]].isna().sum().sum()
    #perc = round(n_miss/gdp_gni_ne.shape[0]*100,2)
    perc = round(n_miss/len(data2)*100,2)
    print(f'Missing values for \033[1m{i}\033[0m is {perc}% ')

In [None]:
# Fill NaNs with mode for categorical variables
data3 = data.copy()
for column in data3.columns:
    data3[column].fillna(data3[column].mode()[0], inplace=True)

In [None]:
data3['PhoneService'].value_counts()

In [None]:
# Confirming the treatment of NA values
for i in data3.columns:
    n_miss = data3[[i]].isna().sum().sum()
    #perc = round(n_miss/gdp_gni_ne.shape[0]*100,2)
    perc = round(n_miss/len(data3)*100,2)
    print(f'Missing values for \033[1m{i}\033[0m is {perc}% ')

In [None]:
# After treating NA values remerge the datasets in one dataset
data_fin = pd.merge(data2, data3, left_index = True, right_index = True)
data_fin = data_fin[['gender', 'SeniorCitizen_y','Partner','tenure_y','PhoneService', 'MultipleLines','InternetService','OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies','Contract','PaperlessBilling','TotalCharges','Churn','Geography','CreditScore_y','EstimatedSalary_y','MonthlyCharges_y','customerID','Dependents','PaymentMethod','Charge_y']]
data_fin.head()

In [None]:
# Confirm the columns inside dataset
for col in data_fin.columns:
  print(col)

# Problem Solving

- The problem solving began with a data preparation process where the data was cleaned to identify and discard errors, incorrect, and missing values.
- The missing values were replaced by the mode or median of the variables to interpret and calculate efficiently.
- The categorical values were transformed into boolean variables.
- The cleaned data was then ready to be analyzed and interpreted to understand trends in the dataset.
- In order to retain variables, proportion was plotted by calculating it for each variable against churn.
- For every iteration, each regression model was executed to generate results.
- The best performing model was selected to finalize the predictive model. 


In [None]:
# Rename columns for readability
data_fin.rename(columns={
    'SeniorCitizen_y': 'SeniorCitizen',
    'tenure_y': 'tenure',
    'CreditScore_y': 'CreditScore',
    'EstimatedSalary_y': 'EstimatedSalary',
    'MonthlyCharges_y': 'MonthlyCharges',
    'Charge_y': 'Charge'
},
                inplace=True)

In [None]:
# Convert Senior Citizen column to interger
data_fin['SeniorCitizen'] = data_fin['SeniorCitizen'].astype(int)
data_fin

In [None]:
#  Plot of counts of values in columns
fig, axes = plt.subplots(2, 3, figsize=(12, 7), sharey=True)
sns.countplot("gender", data=data_fin, ax=axes[0,0])
sns.countplot("SeniorCitizen", data=data_fin, ax=axes[0,1])
sns.countplot("Partner", data=data_fin, ax=axes[0,2])
sns.countplot("Dependents", data=data_fin, ax=axes[1,0])
sns.countplot("PhoneService", data=data_fin, ax=axes[1,1])
sns.countplot("PaperlessBilling", data=data_fin, ax=axes[1,2])

In [None]:
# Transform 'Churn' into a boolean
data_fin_0 = data_fin.copy()
churn_numeric = {'Yes':1, 'No':0}
data_fin_0.Churn.replace(churn_numeric, inplace=True)

Tables were created to see the proportion of customers that churned for each variable. 

Whenever the proportions were very similar for 'yes' or 'no' churn, a note was made to possibly drop the columns during the model creation process. 

In [None]:
# Proportion of Churn by Gender
data_fin_0[['gender','Churn']].groupby(['gender']).mean()

In [None]:
# Proportion of Churn by Multiple Line
data_fin_0[['MultipleLines','Churn']].groupby(['MultipleLines']).mean()

In [None]:
# Proportion of Churn by Senior Citizen
data_fin_0[['SeniorCitizen','Churn']].groupby(['SeniorCitizen']).mean()

In [None]:
# Proportion of Churn by Partner
data_fin_0[['Partner','Churn']].groupby(['Partner']).mean()

In [None]:
# Proportion of Churn by Phone Service
data_fin_0[['PhoneService','Churn']].groupby(['PhoneService']).mean()

In [None]:
# Proportion of Churn by Multiple Line
data_fin_0[['MultipleLines','Churn']].groupby(['MultipleLines']).mean()

In [None]:
# Proportion of Churn by Online Security
data_fin_0[['OnlineSecurity','Churn']].groupby(['OnlineSecurity']).mean()

In [None]:
# Proportion of Churn by Online Backup
data_fin_0[['OnlineBackup','Churn']].groupby(['OnlineBackup']).mean()

In [None]:
# Proportion of Churn by Device Protection
data_fin_0[['DeviceProtection','Churn']].groupby(['DeviceProtection']).mean()

In [None]:
# Proportion of Churn by Tech Support
data_fin_0[['TechSupport','Churn']].groupby(['TechSupport']).mean()

In [None]:
# Proportion of Churn by StreamingTv
data_fin_0[['StreamingTV','Churn']].groupby(['StreamingTV']).mean()

In [None]:
# Proportion of Churn by Streaming Movies
data_fin_0[['StreamingMovies','Churn']].groupby(['StreamingMovies']).mean()

In [None]:
# Proportion of Churn by Contract
data_fin_0[['Contract','Churn']].groupby(['Contract']).mean()

In [None]:
# Proportion of Churn by PaperlessBilling
data_fin_0[['PaperlessBilling','Churn']].groupby(['PaperlessBilling']).mean()

In [None]:
# Proportion of Churn by Geography
data_fin_0[['Geography','Churn']].groupby(['Geography']).mean()

In [None]:
# Proportion of Churn by Dependents
data_fin_0[['Dependents','Churn']].groupby(['Dependents']).mean()

In [None]:
# Proportion of Churn by Payment Method
data_fin_0[['PaymentMethod','Churn']].groupby(['PaymentMethod']).mean()

In [None]:
# Proportion of Churn by tTenure, Monthly Charges, Credit Score, & Estimated Salary
data_fin_0[['tenure','MonthlyCharges','CreditScore','EstimatedSalary','Churn']].groupby('Churn').mean()

In [None]:

# Visualization of the Churn by Senior Citizen
%matplotlib inline
pd.crosstab(data_fin_0.SeniorCitizen,data_fin_0.Churn).plot(kind='bar')
plt.title('Churn by SenCit')
plt.xlabel('SenCit')
plt.ylabel('Churn')
plt.savefig('churn_sencit')

In [None]:
# Create a new dataframe with the variables that were kept for the model
data_fin_1 = data_fin_0[[
    'SeniorCitizen', 'Partner', 'tenure', 'MultipleLines', 'InternetService',
    'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
    'Contract', 'EstimatedSalary', 'MonthlyCharges', 'Dependents',
    'PaymentMethod', 'Churn'
]].copy()

In [None]:
# Confirmation of new dataframe
for col in data_fin_1:
  print(data_fin_1[col].unique())

In [None]:
# Split between X variables and y variables
x_var = [
    'SeniorCitizen', 'Partner', 'tenure', 'MultipleLines', 'InternetService',
    'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
    'Contract', 'EstimatedSalary', 'MonthlyCharges', 'Dependents',
    'PaymentMethod'
]

In [None]:
y_data = data_fin_1.loc[ : , 'Churn'] # y is always churn
x_data = data_fin_1.loc[ : , x_var]

In [None]:
s = (x_data.dtypes == 'object')
object_cols = list(s[s].index)

In [None]:
# If present = 1 if not present = 0
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)

OH_cols = pd.DataFrame(OH_encoder.fit_transform(x_data[object_cols]))

In [None]:
OH_cols.index = x_data.index

In [None]:
OH_cols.set_axis(np.concatenate(OH_encoder.categories_), axis=1, inplace=True)

In [None]:
num_X = x_data.drop(object_cols, axis=1)

In [None]:
OH_X = pd.concat([num_X, OH_cols], axis=1)

In [None]:
OH_X.info()

In [None]:
col_labels = OH_X.columns.values.tolist()

In [None]:
# Name the columns
col_labels[0] = "is senior citizen"
col_labels[1] = "tenure"
col_labels[2] = "estimated salary"
col_labels[3] = "monthly charges"
col_labels[4] = "has no partner"
col_labels[5] = "has partner"
col_labels[6] = "has no multiple lines"
col_labels[7] = "no phone service (multiple lines)"
col_labels[8] = "has multiple lines"
col_labels[9] = "DSL internet service"
col_labels[10] = "fiber optic internet service"
col_labels[11] = "no internet service (internet service)"
col_labels[12] = "no online security"
col_labels[13] = "no internet service (online security)"
col_labels[14] = "has online security"
col_labels[15] = "no online backup"
col_labels[16] = "no internet service (online backup)"
col_labels[17] = "has online backup"
col_labels[18] = "no device protection"
col_labels[19] = "no internet service (device protection)"
col_labels[20] = "has device protection"
col_labels[21] = "no tech support"
col_labels[22] = "no internet service (tech support)"
col_labels[23] = "has tech support"
col_labels[24] = "month-to-month contract"
col_labels[25] = "one-year contract"
col_labels[26] = "two-year contract"
col_labels[27] = "no dependents"
col_labels[28] = "has dependents"
col_labels[29] = "bank transfer (automatic)"
col_labels[30] = "credit card (automatic)"
col_labels[31] = "electronic check"
col_labels[32] = "mailed check"

OH_X.columns = col_labels

In [None]:
# Confirm name changes to columns
full_data = pd.merge(OH_X, y_data, left_index = True, right_index = True)
full_data.head()

In [None]:
# Heatmap of the correlation between the variables
# Correlation above 0.5 or below -0.5 coefficient are shown
# To consider what variables to drop
def get_heatmap(dataframe):

  corr = dataframe.corr()
  plt.figure(figsize=(20, 15))

  # Generate a mask for the upper triangle
  mask = np.triu(np.ones_like(corr, dtype=bool))

  # Only show the strong correlations
  sns.heatmap(corr[(corr >= 0.5) | (corr <= -0.5)],
              cmap='viridis',
              mask=mask,
              vmax=1.0,
              vmin=-1.0,
              linewidths=0.1,
              annot=True,
              annot_kws={"size": 8},
              square=True)
  
get_heatmap(full_data)

In [None]:
# Calculate Variance Inflation Factor, those with high values are dropped
def compute_vif(considered_features):
    
    X = full_data[considered_features]
    # the calculation of variance inflation requires a constant
    X['intercept'] = 1
    
    # create dataframe to store vif values
    vif = pd.DataFrame()
    vif["Variable"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    vif = vif[vif['Variable']!='intercept']
    return vif

In [None]:
# features to consider removing
considered_features = ['monthly charges', 'no internet service (internet service)', 'no online security', 'no internet service (online security)', 'no internet service (online backup)', 'no internet service (device protection)', 'no tech support', 'no internet service (tech support)']

# compute vif 
compute_vif(considered_features).sort_values('VIF', ascending=False)

In [None]:
# create dataset with dropped values

list2 = ['no internet service (tech support)','no internet service (online backup)', 'no internet service (online security)', 'no internet service (internet service)', 'no internet service (device protection)']

for i in list2:
  if i in considered_features:
    considered_features.remove(i)

print(considered_features) 

In [None]:
# Confirm values of VIF
compute_vif(considered_features).sort_values('VIF', ascending=False)

In [None]:
# Columns to drop in the creation of the model
full_data_fs = full_data.drop(['no internet service (tech support)','no internet service (online backup)','no internet service (online security)','no internet service (internet service)','no internet service (device protection)','has no partner','no dependents'],axis=1)

full_data_fs.info()

In [None]:
# Business Decision to drop columns
to_remove = ['has partner', 'no phone service (multiple lines)',
    'has multiple lines', 'fiber optic internet service',
    'no internet service (internet service)', 'has online security',
    'no internet service (online security)', 'has online backup',
    'no internet service (online backup)',
    'no internet service (device protection)', 'has device protection',
    'no internet service (tech support)', 'has tech support',
    'one-year contract', 'two-year contract', 'no dependents'
]


for i in to_remove:
    for col in full_data_fs:
        if i in col:
            full_data_fs.drop(col,axis=1,inplace=True)

In [None]:
# Confirm column drop
full_data_fs.info()

An initial training and testing dataset was created to conduct the test and training. 

In [None]:
x_data = full_data_fs.drop('Churn',axis=1)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.30, random_state=53)

The differents models that were tested in this analysis were:
- Random Forest
- Balanced Randon Forest
- Logarythmic Regression
- Extreme Gradient Boosting
- K-Nearest Neighbors Algorithm 
- Gaussian Mixture


In [None]:
# Different models were run
X_train, X_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.30, random_state=53)

print('RF') # Random Forest

rf = RandomForestClassifier(random_state = 53)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print(classification_report(y_test, y_pred))

print('Balanced RF') # Balanced Randon Forest

brf = BalancedRandomForestClassifier(random_state = 53)
brf.fit(X_train, y_train)
y_pred = brf.predict(X_test)
print(classification_report(y_test, y_pred))

print('LOGREG') # Logarythmic Regression

logreg = LogisticRegression(random_state = 53)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print(classification_report(y_test, y_pred))

print('XGB') # Extreme Gradient Boosting

xgb = XGBClassifier()
xgb.fit(X_train, y_train)
y_pred = xgb.predict(X_test)
print(classification_report(y_test, y_pred))

print('KNN') # K-Nearest Neighbors Algorithm 

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(classification_report(y_test, y_pred))

print('Gaussian') # Gaussian Mixture

gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
print(classification_report(y_test, y_pred))

In [None]:
# Feature Importance is calculated
# Weight how much of a predictive power each variable has
importances = list(rf.feature_importances_)
col_labels = x_data.columns.values.tolist()

dict_test = {"label":col_labels,"importances":importances}

df_test = pd.DataFrame(dict_test, columns=['label','importances'])

df_test.sort_values(by=['importances'])

In [None]:
#Create a heat map of the correlation between the variables
def get_heatmap(dataframe):

  corr = dataframe.corr()
  plt.figure(figsize=(20, 15))

  # Generate a mask for the upper triangle
  mask = np.triu(np.ones_like(corr, dtype=bool))

  # Only show the strong correlations
  sns.heatmap(corr[(corr >= 0.5) | (corr <= -0.5)],
              cmap='viridis',
              mask=mask,
              vmax=1.0,
              vmin=-1.0,
              linewidths=0.1,
              annot=True,
              annot_kws={"size": 8},
              square=True)
  
get_heatmap(x_data)

In [None]:
# Transform  all payment methods to  non automatic payment yes =1 no =0
x_data["non-automatic payment"] = np.where((x_data['electronic check'] == 1) 
                                             | (x_data['mailed check'] == 1),
                                             1, 0)

x_data.drop('electronic check', axis=1, inplace=True)
x_data.drop('bank transfer (automatic)', axis=1, inplace=True)
x_data.drop('credit card (automatic)', axis=1, inplace=True)
x_data.drop('mailed check', axis=1, inplace=True)

In [None]:
# Confirm changes done in the dataset
x_data.info()

In [None]:
# Create a combined variable of Tenure and Monthly charges, named Total Charges
x_data['total charges'] = x_data['tenure'] * x_data['monthly charges']
x_data.drop('tenure',axis=1,inplace=True) # drop of original column
x_data.drop('monthly charges',axis=1,inplace=True) # drop of original column

Another training and testing dataset was created to retest the model.

In [None]:
# Instantiating of Synthetic Minority Over-sampling Technique was used to approach
# construction of classifiers an imbalanced dataset

sm = SMOTE(random_state=53)
X_train, y_train = sm.fit_resample(x_data, y_data) 
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.25, random_state=53)

print('RF') # Random Forest

rf = RandomForestClassifier(random_state = 53)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print(classification_report(y_test, y_pred))

print('BRF') # Balanced Randon Forest

brf = BalancedRandomForestClassifier(random_state = 53)
brf.fit(X_train, y_train)
y_pred = brf.predict(X_test)
print(classification_report(y_test, y_pred))

print('LOGREG') # Logarythmic Regression

logreg = LogisticRegression(random_state = 53)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print(classification_report(y_test, y_pred))

print('XGB') # Extreme Gradient Boosting

xgb = XGBClassifier()
xgb.fit(X_train, y_train)
y_pred = xgb.predict(X_test)
print(classification_report(y_test, y_pred))

print('KNN') # K-Nearest Neighbors Algorithm 

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(classification_report(y_test, y_pred))

print('Gaussian') # Gaussian Mixture

gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
print(classification_report(y_test, y_pred))

In [None]:
# Final check of feature importance.

importances = list(rf.feature_importances_)
col_labels = x_data.columns.values.tolist()

dict_test = {"label":col_labels,"importances":importances}

df_test = pd.DataFrame(dict_test, columns=['label','importances'])

df_test['importances'] = df_test['importances'] * 100
 
df_test.sort_values(by=['importances'],ascending=False)