# **Women In AI Analytics Project**
#### An End-to-End Analytics Learning Workflow for Clustering Analysis

***Contributors to this Notebook:*** Frances Oparaocha

***Supervised by:*** Sam Ayo

## Results
### Assumptions of Clustering Analysis
 - Assumptions of kmeans clustering
      - The variables have the same variance
      - Variance have the same average value
      - The distribution of variables are symmetrical
 - 
 
### Expected Results
1. A predictive model that predicts the customer behaviour.
2. A documented API for interacting with this model that returns predictions based on input

***This notebook solution is divided into 4 Sections, each constituting a workflow on its own:***

- Part I: Initial Data Analysis and Preprocessing
- Part II: EDA
- Part III: Feature Engineering
- Part IV: Predictions

#### <strong>Cumulative WORK FLOW</strong>
1. Import libaries and Data
2. Data inspection
3. EDA/Preprocessing - univariate graphical Exploration and visualization
    - Data cleaning
    - Data manipulation
    - Feature engineering
    - Hypothesis testing
4. Modelling & predicting using RNN, K-means, LGBM, XGBoost, Logistics Regression and 4 ors
5. Model evaluation

#### Setup

##### <strong>Library import</strong>
Import all the required Python libraries.

It is a good practice to organize the imported libraries by functionality, as shown below.

In [62]:
#pip install shap
!pip install mlflow

Collecting mlflow
  Downloading mlflow-2.1.1-py3-none-any.whl (16.7 MB)
     --------------------------------------- 16.7/16.7 MB 27.3 MB/s eta 0:00:00
Collecting alembic<2
  Downloading alembic-1.9.1-py3-none-any.whl (210 kB)
     ------------------------------------- 210.4/210.4 kB 13.3 MB/s eta 0:00:00
Collecting databricks-cli<1,>=0.8.7
  Downloading databricks-cli-0.17.4.tar.gz (82 kB)
     ---------------------------------------- 82.3/82.3 kB ? eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting pyarrow<11,>=4.0.0
  Downloading pyarrow-10.0.1-cp39-cp39-win_amd64.whl (20.3 MB)
     --------------------------------------- 20.3/20.3 MB 25.2 MB/s eta 0:00:00
Collecting docker<7,>=4.0.0
  Downloading docker-6.0.1-py3-none-any.whl (147 kB)
     -------------------------------------- 147.5/147.5 kB 8.6 MB/s eta 0:00:00
Collecting querystring-parser<2
  Using cached querystring_parser-1.2.4-py2.py3-none-any.whl (7.9


[notice] A new release of pip available: 22.2.1 -> 22.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [1]:
# Libraries for data analysis
import pandas as pd
import numpy as np

# Libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

#Libraries for modelling

#### Reading in the datasets

In [2]:
charges = pd.read_csv('charges.csv')
charges.head()

Unnamed: 0,customerID,tenure,contract,paperlessBilling,paymentMethod,monthlyCharges,totalCharges,churn
0,7590-VHVEG,1,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,34,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,2,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,45,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,2,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [16]:
personal = pd.read_csv('personal.csv')
personal.head()

Unnamed: 0,customerID,gender,partner,dependents,age
0,5575-GNVDE,Male,No,No,41
1,3668-QPYBK,Male,No,No,58
2,7795-CFOCW,Male,No,No,61
3,9237-HQITU,Female,No,No,66
4,9305-CDSKC,Female,No,No,87


In [19]:
plan = pd.read_csv('plan.csv')
plan.head()

Unnamed: 0,customerID,phoneService,multipleLines,internetService,onlineSecurity,onlineBackup,deviceProtection,techSupport,streamingTV,streamingMovies
0,5575-GNVDE,Yes,No,DSL,Yes,No,Yes,No,No,No
1,7795-CFOCW,No,No phone service,DSL,Yes,No,Yes,Yes,No,No
2,9237-HQITU,Yes,No,Fiber optic,No,No,No,No,No,No
3,1452-KIOVK,Yes,Yes,Fiber optic,No,Yes,No,No,Yes,No
4,6713-OKOMC,No,No phone service,DSL,Yes,No,No,No,No,No


#### Data Inspection

In [27]:
# Checking the shape of the datasets
print(plan.shape)
print(personal.shape)
print(charges.shape)

(3540, 10)
(5283, 5)
(7032, 8)


In [29]:
# Checking for duplicates
print(charges.duplicated().sum())
print(plan.duplicated().sum())
print(personal.duplicated().sum())

0
0
0


In [37]:
# Checking the info 
print(personal.info())
print(charges.info())
print(plan.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5283 entries, 0 to 5282
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   customerID  5283 non-null   object
 1   gender      5283 non-null   object
 2   partner     5283 non-null   object
 3   dependents  5283 non-null   object
 4   age         5283 non-null   int64 
dtypes: int64(1), object(4)
memory usage: 206.5+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7032 entries, 0 to 7031
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7032 non-null   object 
 1   tenure            7032 non-null   int64  
 2   contract          7032 non-null   object 
 3   paperlessBilling  7032 non-null   object 
 4   paymentMethod     7032 non-null   object 
 5   monthlyCharges    6577 non-null   float64
 6   totalCharges      6577 non-null   float64
 7   churn             7032 

In [11]:
# Checking for missing values
charges.isnull().sum()

customerID            0
tenure                0
contract              0
paperlessBilling      0
paymentMethod         0
monthlyCharges      455
totalCharges        455
churn                 0
dtype: int64

In [40]:
# Checking the statisitical distribution of the dataset
charges.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
tenure,7032.0,32.421786,24.54526,1.0,9.0,29.0,55.0,72.0
monthlyCharges,6577.0,64.654637,30.101974,18.25,35.25,70.3,89.85,118.75
totalCharges,6577.0,2274.584719,2263.042489,18.8,399.45,1389.85,3775.85,8684.8


In [41]:
# Checking the statisitical distribution of the dataset
personal.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,5283.0,55.017414,20.040712,20.0,38.0,55.0,72.0,90.0


In [42]:
# Checking the statisitical distribution of the dataset
plan.describe().T

Unnamed: 0,count,unique,top,freq
customerID,3540,3540,5575-GNVDE,1
phoneService,3540,2,Yes,3207
multipleLines,3540,3,No,1692
internetService,3540,3,Fiber optic,1577
onlineSecurity,3540,3,No,1745
onlineBackup,3540,3,No,1529
deviceProtection,3540,3,No,1528
techSupport,3540,3,No,1759
streamingTV,3540,3,No,1430
streamingMovies,3540,3,Yes,1388


In [63]:
# Merging all the datasets 
df = charges.merge(personal, how='inner').merge(plan,how='inner')
df

Unnamed: 0,customerID,tenure,contract,paperlessBilling,paymentMethod,monthlyCharges,totalCharges,churn,gender,partner,...,age,phoneService,multipleLines,internetService,onlineSecurity,onlineBackup,deviceProtection,techSupport,streamingTV,streamingMovies
0,5575-GNVDE,34,One year,No,Mailed check,56.95,1889.50,No,Male,No,...,41,Yes,No,DSL,Yes,No,Yes,No,No,No
1,7795-CFOCW,45,One year,No,Bank transfer (automatic),42.30,1840.75,No,Male,No,...,61,No,No phone service,DSL,Yes,No,Yes,Yes,No,No
2,9237-HQITU,2,Month-to-month,Yes,Electronic check,70.70,151.65,Yes,Female,No,...,66,Yes,No,Fiber optic,No,No,No,No,No,No
3,1452-KIOVK,22,Month-to-month,Yes,Credit card (automatic),89.10,1949.40,No,Male,No,...,39,Yes,Yes,Fiber optic,No,Yes,No,No,Yes,No
4,6713-OKOMC,10,Month-to-month,No,Mailed check,29.75,301.90,No,Female,No,...,39,No,No phone service,DSL,Yes,No,No,No,No,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3532,6894-LFHLY,1,Month-to-month,Yes,Electronic check,75.75,75.75,Yes,Male,No,...,46,Yes,Yes,Fiber optic,No,No,No,No,No,No
3533,8456-QDAVC,19,Month-to-month,Yes,Bank transfer (automatic),78.70,1495.10,No,Male,No,...,24,Yes,No,Fiber optic,No,No,No,No,Yes,No
3534,2569-WGERO,72,Two year,Yes,Bank transfer (automatic),21.15,1419.40,No,Female,No,...,33,Yes,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service
3535,2234-XADUH,72,One year,Yes,Credit card (automatic),,,No,Female,Yes,...,46,Yes,Yes,Fiber optic,No,Yes,Yes,No,Yes,Yes


In [64]:
# checking for missing values
df.isnull().sum()

customerID            0
tenure                0
contract              0
paperlessBilling      0
paymentMethod         0
monthlyCharges      234
totalCharges        234
churn                 0
gender                0
partner               0
dependents            0
age                   0
phoneService          0
multipleLines         0
internetService       0
onlineSecurity        0
onlineBackup          0
deviceProtection      0
techSupport           0
streamingTV           0
streamingMovies       0
dtype: int64

In [65]:
# Viewing the number of rows and columns in the dataset
df.shape

(3537, 21)

In [69]:
df['monthlyCharges'].isnull()

0       False
1       False
2       False
3       False
4       False
        ...  
3532    False
3533    False
3534    False
3535     True
3536    False
Name: monthlyCharges, Length: 3537, dtype: bool