# Chustomer Churn Prediction
## Project Overview
A Data from the Telco DomainDue to tough competition the customers tend to swtich between the telecommunication service providersE.gan Airtel customer might transition to Jio services and vice versaThis behaviour from the customers is known as churn.

## Objective
To be able to predict if a customer would churn or notTake the Next Best Action to prevent churn.

## Stages to be convered during the solution
- `Data Merging and Wrangling:` Combining multiple data sources and cleaning the data
- `Exploratory Data Analysis:` Understanding the relationship between features and with target
- `Data Preprocessing:` Data Encoding, Missing Value Treatment, Outlier Treatment, Feature Scaling
- `Model Building:` Train ML Model using the pre-processed data
- `Evaluation:` Assess the Model's performace

By the end of this project, you will have a complete workflow for predicting churn and/or creating classification models.


## Domain Backgroud (Telecom Churn Stroy)

I’m working as a data analyst at a telecommunications company, which I refer to as TeleComCoTeleComCo provides phone and internet services to a wide range of customers, and, as with most telecoms, customer churn—when customers stop using our service—is a key concernHigh churn rates lead to lost revenue and may signal customer dissatisfaction.

Recently, in the last quarter, we noticed a rise in customers leaving for competitorsIn response, our management tasked my team with investigating the reasons behind this churn and developing a model to predict which customers are most at risk of leavingThe purpose is to identify these customers in advance and proactively offer them incentives to stay.

For this project, I’ve been given a dataset detailing account information for both past and current customers, along with data indicating whether or not each customer eventually churnedMy primary responsibilities include:
- Analyzing this dataset to detect patterns and factors associated with customer churn.
- Building a predictive model (specifically using logistic regression) to estimate churn risk for individual customers.
- Through this data exploration, I expect to identify patterns such as:
- Customers with longer tenures are generally less likely to churn, while newer customers may be at greater risk.
- Those with certain types of plans or higher monthly charges might be more inclined to leave, possibly due to the cost factor.
- Demographic details could influence churn—for instance, senior citizens may use our services differently or have specific needs.

Customer preferences, like opting for paperless billing or bundling phone and internet, might also relate to their likelihood of churning.
By thoroughly investigating these factors and building the predictive model, we aim to help TeleComCo understand why customers leave and reduce future churn through timely interventions.

## Dataset Description

The dataset consists of customer records, each with a variety of features describing the customer and their service usage. Below is an overview of each column in the data:
- `customer_id:` A unique identifier for each customer (e.g., a UUID). This is just an ID and not useful for prediction.
- `customer_email:` The email address-of the customer. This is an identifier as well and not directly useful for the model.
- `age:` The age of the customer (in years). This could be related to churn if different age groups have different service preferences.
- `senior_citizen:` Whether the customer is a senior citizen or not (boolean: true/false). Typically, this might be derived from age (e.g., age > 65).
- `partner:` Whether the customer has a partner or not (boolean). This indicates if the customer is married or in a long-term partnershipIn telecom, having a partner
might mean family plans or shared services.
- `dependents:` Whether the customer has dependents (children or other dependents) or not (boolean) Customers with dependents might have different usage (e.g.,family plans).
- `tenure_months:` The number of months the customer has been with the company. Higher tenure might indicate loyalty; low tenure customers are newer and might be more likely to churn if they haven't established loyalty.
- `phone_service:` Whether the customer has phone service with the company (boolean). Some customers might only have internet service; this feature tells if they also subscribed to phone.
- `paperless_billing:` Whether the customer has opted for paperless billing (boolean). This could be a proxy for tech-savvy behavior or convenience preference.
- `monthly_charges:` The amount `$` charged to the customer every month. This is like their monthly bill. Customers with higher bills might churn due to cost, or those with very low bills might churn if they are not using many services.
- `total_charges:` The total amount `$` the customer has been charged since joining (this is roughly monthly_charges * tenure, plus any extras). This can indicate the overall value of the customer; low total charges might mean the customer is relatively new or has a low-cost plan.
- `churn:` The target variable - whether the customer has churned (true = yes, the customer left; false = no, the customer is still with the company). This is what we want to
predict.
- `last_interaction_date:` The date of the last interaction with the customer (could be the last service use or last customer support call, etc.). This might give insight into how recently the customer was active. Customers with very old last interactions might have silently churned.
- `region:` The geographic region or state where the customer resides (e.g., Ohio, California, etc.). Different regions might have different market conditions or competitor
presence, possibly affecting churn.
- `signup_date:` The date when the customer originally signed up for service. (Note: This column is present in one of the source files. When we merge data, some records might not have a signup_date if it wasn't recorded for them.)



## Potential Questions and Considerations:

Based on the above features, here are some questions that might arise and that we will explore in this project:
- Do older customers or senior citizens tend to churn more or less than younger customers?
- Does having a partner or dependents influence churn? (For example, do single customers churn more often than those with family plans?)
- How does tenure relate to churn? Are newer customers more likely to leave compared to long-term customers?
- What about monthly charges? Are customers with high monthly charges more likely to churn (perhaps due to higher cost), or could it be that those with low charges churn because they might not be fully utilizing the service?
- Are there any regional trends in churn? (We might check if certain regions have higher churn rates.)
- How do features like phone service or paperless billing correlate with churn? (e.g., maybe paperless billing users are more engaged or maybe less personal interaction leads to higher churn?)
- Are there outliers or unusual values in charges or tenure that need special attention?
  
`Try to answer these questions step-by-step in the analysis below.`

### Task 0: Import required libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
# to tell python to show the pyplot in the outplut seciton of the cell
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

### Task 1: Combine the two datasets
The customer data is provided in two CSV files (say, Customer Churn_data.cv and Customer Churn data_2.cv). Load both files and combine them into a single pandas
DataFrame for analysis. The two files have the same columns (one file may have an extra column signup_date). Ensure that after merging, all columns are aligned correctly.

In [2]:
file1 = pd.read_csv("Customer_Churn_data.csv")
print(file1.shape)
file1.head(2)

FileNotFoundError: [Errno 2] No such file or directory: 'Customer_Churn_data.csv'

In [None]:
file2 = pd.read_csv("Customer_Churn_data_2.csv")
print(file2.shape)
file2.head(2)

### Task 2: View the first few rows of the combined data

After merging, use the DataFrame's head© method to display the first 5 rows of the combined dataset. This will help verify that the data from both files has been concatenated
correctly and that columns are as expected.

In [None]:
data = pd.concat([file1, file2], ignore_index=True)
data.sample(5)

### Task 3: Understand the dataset dimensions and dtypes
Determine the size of the combined dataset. Find out how many rows and columns are present. This can be done using the DataFrame's .info0 method. This will show the data type of each column and whether there are any missing values (non-null counts) in each column. Verfy that numeric columns are correctly recognized (e.g. age.
tenure months should be int or float, charges should be float, churn and otner booleans might appear as bool

In [None]:
data.info()

`Verify if the data has been loaded perfectly`

In [None]:
# last_interaction_date --> convert to datetime
data['last_interaction_date'] = pd.to_datetime(data['last_interaction_date'])
data.info()

In [None]:
# signup_date --> convert into correct datatype (datetime)
data['signup_date'] = pd.to_datetime(data['signup_date']) #infer_datetime_formate=True --> to handle mixed formats in a single col
data.info()

### Task 4: Generate summary statistics
Use the .describe() method on the DataFrame to get summary statistics for the numeric columns (count, mean, std, min, quartiles, max). This will give an overview of the
distributions (e.g., average age, average tenure, min/max charges, etc.).

In [None]:
data.describe(include=object)

In [None]:
data.describe(include=int)

In [None]:
data.describe(include=float)
# tenure*monthly charges = total_charges

In [None]:
data.describe(include=bool)

In [None]:
data.describe(include=['datetimetz'])

### Task 5: Check for duplicate entries
Ensure there are no duplicate customer records in the data. For instance, verify if customer_id is unique across the combined dataset. You can use pandas functions like
.duplicated() on the customer_id column to check for any duplicates.

In [None]:
data['customer_id'].duplicated().sum() #number of duplicated rows in the given col/df

In [None]:
data.duplicated().sum() # number of duplicated rows in the given df

In [None]:
data.sample(3)

### Task 6: Identify missing values
Identify if there are any missing values in the dataset and in which columns. Use methods like .isnull().sum() to get the count of null or NaN values per column. This will highlight columns that need attention (e.g., we expect many missing in signup_date if one file lacked it).

In [None]:
data.isnull().sum()

In [None]:
sns.heatmap(data.isnull())

### Task 7: Analyze the pattern of missing data
- Examine the missing data pattern and determine the likely mechanism: - Are the missing values MCAR (Missing Completely at Random) - i.e., no identifiable pattern, just
random? - Or MAR (Missing At Random) - i.e., the missingness is related to some other observed data? - Or MNAR (Missing Not at Random) - i.e., the missingness has a
pattern related to the unobserved value itself or is systematically absent for a particular subset?

- Question: Based on the columns with missing data, what type of missingness do you suspect? For example, if signup_date is missing for all customers from one file, that's a systematic pattern (likely MNAR or a data collection issue). Document your reasoning. --> It is not missing at random, it is missing on purpose.

### Task 8: Handle missing values
- Decide on a strategy to handle the missing data identified above. For instance: - If a column has too many missing values (or is not crucial), you might choose to drop that column. - If only a few records have missing values, you might choose to fill (impute) them with an appropriate value (mean, median, mode, or a special indicator).
- Apply the chosen strategy. For example, if signup_date is missing for a large portion and not critical to the analysis, you might drop the signup_date column to simplify the
dataset.

### Task 9: Drop unneeded columns
- There are some columns that are not useful for predicting churn and could be removed to simplify the analysis. Typically, identifier columns like customer_id and customer_email do not have predictive value. Also, if we have decided not to use certain columns (like dates or any others) for modeling, we can drop them as well to avoid clutter.
- Remove the following: - customer_id and customer_email (identifiers) - last_interaction_date (a date field that we will not use in the model for now, to keep things simple) -
signup_date (if you did not drop it already in the missing data step)

In [None]:
data.info()

In [None]:
data.set_index('customer_id', inplace=True) # we can choose to make cols with all unique values as row index for better reference

In [None]:
data.sample(4)

In [None]:
# since email is  a PII (Personal Information Identifier) it should never be used as a machine learning feature
# since there is no logic to replace/fill the missing values for signup_date, we would drop the column
print(data.shape)
data.drop(['customer_email', 'signup_date'], axis=1, inplace=True)

In [None]:
print(data.shape)
data.sample(3)

In [None]:
# RFM modeling --> Recency (How recent), Frequency (How frequent), Monitory (How much pay)
# here in the data: last_interaction_date is Recency, tenure_months is Frequency and total_charges is Monitory

## Exploratory Data Analysis (EDA)
Now that the data is clean and prepared, let's perform some exploratory analysis to understand the data better and to gather insights about what factors might affect churn.
We will look at the distribution of variables and relationships between features and the churn outcome.

### Task 10: Examine the distribution of the target variable (Churn)

Let's see how many customers in our dataset churned vs. stayed. Plot a count of churned vs non-churned customers. This can be done using a bar plot (or simply checking the value counts). This will tell us the balance of our classes (churn vs no churn).

In [None]:
data.churn.value_counts().plot(kind='bar') # We have a balanced data set with 50:50 rows for both calsses

### Task 11: Distribution of customer ages
Plot a histogram of the age of customers. This will show the distribution of customer ages. Are most customers in a certain age range? This might help identify if our customer
base is younger or older on average.

In [None]:
data.age.hist(bins=20) # The age distribution seems to be uniformly distributed
plt.show()

In [None]:
data.boxplot('age')

In [None]:
data.age.describe()

### Task 12: Distribution of customer tenure
Plot a histogram of the tenure_months to see how long customers tend to stay with the company. Is there a large number of new customers (low tenure) in the data? Do we see many customers at the maximum tenure (72 months, if that's the max)? Understanding tenure distribution will help in analyzing churn by tenure later.

In [None]:
sns.histplot(data.tenure_months)
plt.show()

In [None]:
data.boxplot('tenure_months')
plt.show()

In [None]:
data.tenure_months.describe()

In [None]:
data.tenure_months.value_counts()

- `Analysis: we can see customers who have spent time over the full range. Uniform distribution`
- `incase of uneven dist we can transform the variable into categories like new, <3M, <6M, <1Y, <2Y..`

### Task 13: Distribution of monthly charges
Plot a histogram of the monthly_charges. This shows the distribution of monthly billing amounts. We can see the range of charges and if it's skewed (e.g., many customers at lower tiers vs higher tiers). Sometimes, very high or very low charges could influence churn.

In [None]:
sns.histplot(data.monthly_charges, bins=30)
plt.show()

`Analysis: comparively the high paying customers are less in number`

In [None]:
data.monthly_charges.plot(kind='kde')

In [None]:
sns.boxplot(data.monthly_charges)
plt.show()

In [None]:
data.monthly_charges.describe()
# for floating variables value counts is not of much use for analysis

### Task 14: Churn rate by senior citizen status
- Question: Are senior citizens more likely to churn compared to non-senior customers? Calculate the churn rate for senior citizens vs non-senior citizens. Churn rate can be defined as the percentage of customers in that group who have churned.
- You can do this by grouping the data by senior_citizen and calculating the mean of the churn column (if churn is encoded as 0/1, the mean gives the proportion that churned).
Alternatively, use value_counts of churn within each group.

In [None]:
data.groupby('senior_citizen')['churn'].mean()

### Task 15: Churn rate by partner status
Question: Does having a partner influence churn? Compute the churn rate for customers with a partner vs without a partner. Similar to above, group by partner and find the proportion that churned in each group.

In [None]:
data.groupby('partner')['churn'].mean().plot(kind='bar')

### Task 16: Average tenure of churned vs non-churned customers
Question: Do customers who churn tend to have shorter tenures? Calculate the average tenure (in months) for churned customers vs customers who stayed. This can be done
by grouping by churn status.

In [None]:
data.groupby('churn')['tenure_months'].mean().plot(kind='pie')

### Task 17: Average monthly charges of churned vs non-churned customers
Question: Do customers who churn pay more per month? Find the average monthly_charges for churned vs non-churned groups.

In [None]:
data.groupby('churn')['monthly_charges'].mean()

### Task 18: Average total charges of churned vs non-churned customers
Question: How do the total charges differ between churned and retained customers? Calculate the average total_charges for churned vs non-churned customers. (Since
total_charges is a function of monthly charges and tenure, this will reflect both how long and how much a churned customer contributed versus a stayed customer.)

In [None]:
data.groupby('churn')['total_charges'].mean()

### Task 19: Correlation analysis
Calculate the correlation matrix for the numeric features (and the churn indicator, encoded as 0/1). This will show how strongly features are linearly related to each other and to churn. In particular, look at correlations involving churn. Are any features strongly positively or negatively correlated with churn? Also note if any pair of features are highly correlated with each other (for instance, tenure and total_charges might be strongly correlated since longer tenure usually means more total charges).

In [None]:
corr=data.corr(numeric_only=True)
corr

In [None]:
plt.figure(figsize=(15,8))
sns.heatmap(corr, annot=True, cmap='Greens', center=0, fmt=".3f")
# Analysis: Sicne none of the feature are highly correlated to any other feature, we do not need to drop any features
# If x features a highly correlated, then keep 1 of them and drop X-1, e.g. if 7 features are highly correlated then keep ANY 1 and drop the rest.

## Outlier Detection and Treatment
- Outliers are extreme values that deviate significantly from the rest of the data. They can affect our model, especially logistic regression which could be influenced by very large or small values. We will detect outliers in key numeric columns and decide on how to handle them.

- We'll use two common methods: - Interquartile Range (IQR) method: We consider points as outliers if they fall below Q1 - 1.5/QR or above Q3 + 1.5IQR for a given feature. - Z-
score method: We calculate the z-score (standard score) for each data point in a feature. A common rule is to treat points with |z| > 3 as potential outliers (3 standard
deviations away from the mean).

- We will apply these methods to the monthly_charges (as an example numeric feature, since charges could have outliers).

### Task 20: Detect outliers in monthly_charges using the IQR method
Calculate the first quartile (Q1) and third quartile (Q3) of monthly_charges, then compute the IQR (Q3 - Q1). Determine the IQR bounds:
- Lower bound = Q1 - 1.5 * IQR
- Upper bound = Q3 + 1.5 * IQR

Find which data points in monthly_charges lie outside these bounds. How many outliers do you detect using this rule?

In [None]:
data.monthly_charges.describe()

In [None]:
Q1 = data.monthly_charges.quantile(0.25)
Q3 = data.monthly_charges.quantile(0.75)

IQR = Q3-Q1
print(f"Q1: {round(Q1,2)}\nQ3: {round(Q3,2)}\nIQR: {round(IQR,2)}".format(".2f"))

In [None]:
lower_bound = Q1-1.5*IQR
upper_bound = Q3+1.5*IQR
print(f"Lower Bound: {round(lower_bound,2)}\nUpper Bound: {round(upper_bound,2)}")

In [None]:
ul_outliers = data[data['monthly_charges'] > upper_bound].shape[0]
ll_outliers = data[data['monthly_charges']< lower_bound].shape[0]

In [None]:
print(ul_outliers, ll_outliers)

In [None]:
sns.boxplot(data.monthly_charges,color='Green')

### Task 21: Detect outliers in monthly_charges using the Z-method (This method is used when you have more than 20% as outliers uisng the IQR method)
Calculate the Mean and Standard Deviation of monthly_charges. Determine the Z-Methods bounds:
- Lower bound = MEAN - 3 * STD_DEV
- Upper bound = MEAN + 3 * STD_DEV

Find which data points in monthly_charges lie outside these bounds. How many outliers do you detect using this rule?

In [None]:
mean = np.mean(data.monthly_charges)
std = np.std(data.monthly_charges)

LL = mean - 3 * std
UL = mean + 3 * std

ul_outliers = data[data['monthly_charges'] > UL].shape[0]
ll_outliers = data[data['monthly_charges']< LL].shape[0]

print(f"Mean: {round(mean,2)} | Std Dev: {round(std,2)} | UL: {round(UL,2)} | LL: {round(LL,2)} | ul_outlier: {round(ll_outliers,2)} | ll_outliers: {round(ll_outliers,2)}")

### Task 22: We can now, run the same analysis for other numeric(int and float) columns

`Outlier Treatment`

Decide how to handle any outliers found in monthly_charges (and any other numeric columns if you checked them). Common strategies include: - Removing the outlier rows entirely. - Capping the outliers (e.g., set values above the upper bound to the upper bound, and below the lower bound to the lower bound). - Keeping them if they are legitimate values and not overly influential.

For this analysis, if outliers exist and are very few, you might choose to remove those records for simplicity. Alternatively, if they are not extreme or numerous, you might leave them in but be aware of them.

Implement the chosen outlier treatment for monthly_charges. (If no outliers were detected by either method, you can state that no action is needed or just skip removal.)

## Data Preprocessing (Encoding, Splitting, Scaling)
Before we can feed the data into a logistic regression model, we need to prepare the features: - Convert categorical and boolean features into numeric form (encoding). - Split the data into training and test sets. - Scale/normalize features if needed, so that no single feature dominates due to scale differences (this can help the model converge faster and improve performance).

### Task 23: Encode the target variable churn as numeric

The churn column is currently in boolean (true/false) form (or Yes/No). Convert it to a numeric binary format, e.g., 1 for "Yes/True" (customer churned) and 0 for "No/False" (customer stayed). This will be our target y for modeling.

In [None]:
data.churn = data.churn.astype(int)
data.churn.sample(3)

### Task 24: Convert other boolean columns to 0/1
Similarly, convert all other boolean columns (senior_citizen, partner, dependents, phone_service, paperless_billing) into 0/1 numeric values (if they are not already numeric). This ensures that all features are in numeric form for the model.

In [None]:
data.info()

In [None]:
bool_cols = ['senior_citizen', 'partner', 'dependents', 'phone_service', 'paperless_billing']
data[bool_cols] = data[bool_cols].astype(int)

In [None]:
data.info()

### Task 25: Extract Date Features like dayofweek, Month, Date, Year, Hour, Mins etc. and drop the timestamp col

In [None]:
data['day_of_week'] = data['last_interaction_date'].dt.weekday
data['year'] = data['last_interaction_date'].dt.year
data['month'] = data['last_interaction_date'].dt.month
data['date'] = data['last_interaction_date'].dt.day
data['hour'] = data['last_interaction_date'].dt.hour
data['minute'] = data['last_interaction_date'].dt.minute
data['second'] = data['last_interaction_date'].dt.second
data['week_of_year'] = data['last_interaction_date'].dt.isocalendar().week
data['quarter'] = data['last_interaction_date'].dt.quarter
data.drop('last_interaction_date', axis=1, inplace=True)

In [None]:
data.sample(3)

### Task 26: One-hot encode the region column
The region column is categorical with many possible values (states or regions). We need to convert it into a numeric form. Use one-hot encoding to create dummy variables for each unique region. For example, region "Ohio" becomes a binary column that is 1 for Ohio residents and 0 otherwise, and so on for each region.

You can use pandas get_dummies function to do this. Be careful to avoid the dummy variable trap (when one dummy column is redundant because it can be inferred from
others). You can set drop_first=True to drop one of the region dummy columns, or handle it manually.

In [None]:
region_dummies = pd.get_dummies(data['region'], drop_first=True, dtype='int')

data = pd.concat([data, region_dummies], axis=1)
data.drop('region', axis=1, inplace=True)

data.sample(3)

### Task 27: Separate features and target variable
Now that the data is preprocessed, split the DataFrame into features (X) and target (y). - y should be the churn column (the 0/1 labels we want to predict). - X should be all the remaining columns that will serve as inputs to the model.

Make sure that X does not include the target itself or any columns we decided to drop (like IDs, emails, etc., which we already removed).

In [None]:
X = data.drop(columns='churn')
y = data.churn

### Task 28: Split data into training and testing sets
Use sklearn.model_selection.train_test_split to split the dataset into training and test sets. Typically, we might use 70% of the data for training and 30% for testing (or 80/20,etc.). Set a random_state for reproducibility.

The result should be X_train, X test, y_train, y_test.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42) #try stratify=y

### Task 29: Feature scaling (Standardization)
For logistic regression, it is often beneficial to scale the features so they are on comparable scales (although logistic regression can still work without scaling, scaling can improve convergence and performance, especially if regularization is used).

Use a StandardScaler (from sklearn.preprocessing) to standardize the numeric features in X. Important: Fit the scaler on the training data only, then use it to transform both the training and testing feature data. This prevents information from the test set leaking into the training process.

In [None]:
pd.set_option('display.max_columns',70)

X_train.sample(5)

In [None]:
# Standard Scaler --> -4 to +4
# MinMaxScaler (Normalization) --> 0 to 1
# Since our 50% of dataset values are already binary (0/1), so we are going to use MinMaxScaler so that out whole dataset would be binary
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns, index=X_train.index)
X_test = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns, index=X_test.index)
# If we are scaling before splitting, we just need to fit transform data once, but if we are doing after scaling, we have to fit_transform X_train and just transform X_test

### Task 30: Prepare ML Model, make predictions and perform evaluation
- Logistic Regression
- RandomForest
- SVM/NaiveBayes

#### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)

In [None]:
ypred_lr = lr_model.predict(X_test)

In [None]:
from sklearn.metrics import classification_report

In [None]:
print(classification_report(y_test, ypred_lr))

#### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
rfc_model = RandomForestClassifier(n_estimators=500)
rfc_model.fit(X_train, y_train)
ypred_rfc = rfc_model.predict(X_test)
print(classification_report(y_test, ypred_rfc))
#               precision    recall  f1-score   support

#            0       0.50      0.51      0.50     20143
#            1       0.49      0.49      0.49     19857

#     accuracy                           0.50     40000
#    macro avg       0.50      0.50      0.50     40000
# weighted avg       0.50      0.50      0.50     40000



In [None]:
feature_imp = pd.DataFrame({
    'feature': X_train.columns,
    'importance': rfc_model.feature_importances_
}).sort_values(by='importance', ascending=False)

In [None]:
pd.set_option('display.max_rows', 70)
feature_imp

In [None]:
xtrain_imp_features = X_train[['monthly_charges', 'total_charges', 'tenure_months', 'age', 'minute', 'second', 'hour', 'date', 'week_of_year', 'day_of_week', 'month']]
xtest_imp_features = X_test[['monthly_charges', 'total_charges', 'tenure_months', 'age', 'minute', 'second', 'hour', 'date', 'week_of_year', 'day_of_week', 'month']]

In [None]:
rfc_model.fit(xtrain_imp_features, y_train)
ypred_rfc = rfc_model.predict(xtest_imp_features)
print(classification_report(y_test, ypred_rfc))

#### SVM

In [None]:
from sklearn import svm
svm_model = svm.SVC(kernel='linear')
svm_model.fit(X_train, y_train)
ypred_svm = svm_model.predict(X_test)
print(classification_report(y_test, ypred_svm))

#### Naive Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb.fit(X_train, y_train)
ypred_nb = nb.predict(X_test)
print(classification_report(y_test, ypred_nb))