# Assessing Customer Churn Using Machine Learning

![Cartoon of telecom customers](IMG_8811.png)


The telecommunications (telecom) sector in India is rapidly changing, with more and more telecom businesses being created and many customers deciding to switch between providers. "Churn" refers to the process where customers or subscribers stop using a company's services or products. Understanding the factors that influence keeping a customer as a client in predicting churn is crucial for telecom companies to enhance their service quality and customer satisfaction. As the data scientist on this project, you aim to explore the intricate dynamics of customer behavior and demographics in the Indian telecom sector in predicting customer churn, utilizing two comprehensive datasets from four major telecom partners: Airtel, Reliance Jio, Vodafone, and BSNL:

- `telecom_demographics.csv` contains information related to Indian customer demographics:

| Variable             | Description                                      |
|----------------------|--------------------------------------------------|
| `customer_id `         | Unique identifier for each customer.             |
| `telecom_partner `     | The telecom partner associated with the customer.|
| `gender `              | The gender of the customer.                      |
| `age `                 | The age of the customer.                         |
| `state`                | The Indian state in which the customer is located.|
| `city`                 | The city in which the customer is located.       |
| `pincode`              | The pincode of the customer's location.          |
| `registration_event` | When the customer registered with the telecom partner.|
| `num_dependents`      | The number of dependents (e.g., children) the customer has.|
| `estimated_salary`     | The customer's estimated salary.                 |

- `telecom_usage` contains information about the usage patterns of Indian customers:

| Variable   | Description                                                  |
|------------|--------------------------------------------------------------|
| `customer_id` | Unique identifier for each customer.                         |
| `calls_made` | The number of calls made by the customer.                    |
| `sms_sent`   | The number of SMS messages sent by the customer.             |
| `data_used`  | The amount of data used by the customer.                     |
| `churn`    | Binary variable indicating whether the customer has churned or not (1 = churned, 0 = not churned).|


### Project Requirements

Does Logistic Regression or Random Forest produce a higher accuracy score in predicting telecom churn in India?

Load the two CSV files into separate DataFrames. Merge them into a DataFrame named churn_df. Calculate and print churn rate, and identify the categorical variables in churn_df.
Convert categorical features in churn_df into features_scaled. Perform feature scaling separating the appropriate features and scale them. Define your scaled features and target variable for the churn prediction model.
Split the processed data into training and testing sets giving names of X_train, X_test, y_train, and y_test using an 80-20 split, setting a random state of 42 for reproducibility.
Train Logistic Regression and Random Forest Classifier models, setting a random seed of 42. Store model predictions in logreg_pred and rf_pred.
Assess the models on test data. Assign the model's name with higher accuracy ("LogisticRegression" or "RandomForest") to higher_accuracy.

### Importing the required libraries

In [5]:
# Import libraries and methods/functions
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import RidgeClassifier, LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Start your code here!

### How to approach this project?

1. Loading and exploring data

2. Processing the joined data

3. Splitting the data

4. Getting the predictions

5. Assessing the models

6. Identifying which model has higher accuracy

### 1. Loading and exploring data

You should load the telecom_demographics.csv and telecom_usage.csv datasets, join them, calculate the proportion of churn, and identify categorical variables.

In [None]:
# Load data
telco_demog = pd.read_csv('telecom_demographics.csv')
telco_usage = pd.read_csv('telecom_usage.csv')


### 2. Processing the joined data

You should convert categorical variables into numerical representations, standardize all relevant features, and then define your target variable.


In [None]:
# Join data
churn_df = telco_demog.merge(telco_usage, on='customer_id')

# Identify churn rate
churn_rate = churn_df['churn'].value_counts() / len(churn_df)
print(churn_rate)

# Identify categorical variables
print(churn_df.info())

# One Hot Encoding for categorical variables
churn_df = pd.get_dummies(churn_df, columns=['telecom_partner', 'gender', 'state', 'city', 'registration_event'])

# Feature Scaling
scaler = StandardScaler()

# 'customer_id' is not a feature
features = churn_df.drop(['customer_id', 'churn'], axis=1)
features_scaled = scaler.fit_transform(features)

# Target variable
target = churn_df['churn']


### 3. Splitting the data

You should split the standardized features and the defined target variable into an 80-20 training and testing split assigning a random state of 42.

In [None]:

# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(features_scaled, target, test_size=0.2, random_state=42)


### 4. Getting the predictions

You should instantiate, fit, and produce predictions for each of the Logistic Regression and Random Forest models.

In [None]:
# Instantiate the Logistic Regression
logreg = LogisticRegression(random_state=42)
logreg.fit(X_train, y_train)

# Logistic Regression predictions
logreg_pred = logreg.predict(X_test)

# Logistic Regression evaluation
print(confusion_matrix(y_test, logreg_pred))
print(classification_report(y_test, logreg_pred))

# Instantiate the Random Forest model
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)

# Random Forest predictions
rf_pred = rf.predict(X_test)

### 5. Assessing the models

You should print confusion matrices and classification reports for each of the Logistic Regression and Random Forest models.

In [None]:

# Random Forest evaluation
print(confusion_matrix(y_test, rf_pred))
print(classification_report(y_test, rf_pred))


### 6. Identifying which model has higher accuracy

You should look at the classification reports to see which accuracy score is closer to 1.




In [None]:
# Which accuracy score is higher? Ridge or RandomForest
higher_accuracy = "RandomForest"