# Predicting Customer Churn for a Telecommunications Company
by P.Chamisa

## Introduction
Telecommunications companies rely heavily on customer retention to maintain their profitability. Losing customers to competitors can result in significant revenue loss, especially considering the high acquisition costs associated with attracting new customers. Therefore, it is crucial for these companies to identify customers who are at risk of churning and take proactive measures to retain them.

In this portfolio project, we will use machine learning to analyze customer data and predict which customers are likely to churn, allowing the company to take preventative measures. Specifically, we will be working with a dataset from a telecommunications company containing information on customer demographics, account information, and service usage. Our goal is to build a model that can accurately predict whether a customer will churn or not.

## Data Exploration and Preparation
First, we'll load the dataset and perform some initial exploratory data analysis (EDA) to better understand the data.

In [None]:
# Load the dataset
import pandas as pd

df = pd.read_csv('telecom_data.csv')

# Examine the first few rows
df.head()

We can see that the dataset contains 21 columns, including customer demographics (such as age, gender, and income), 
account information (such as contract type and payment method), and service usage (such as number of phone lines and internet usage).

Next, we'll perform some data cleaning and feature engineering to prepare the data for modeling.

In [None]:
# Drop irrelevant columns
df.drop(['customerID', 'gender'], axis=1, inplace=True)

# Encode categorical variables
df = pd.get_dummies(df, columns=['Partner', 'Dependents', 'PhoneService', 'MultipleLines',
                                 'InternetService', 'OnlineSecurity', 'OnlineBackup', 
                                 'DeviceProtection', 'TechSupport', 'StreamingTV', 
                                 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod'])

# Scale numerical features
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
num_cols = ['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges']
df[num_cols] = scaler.fit_transform(df[num_cols])

# Create target variable
df['Churn'] = df['Churn'].map({'Yes': 1, 'No': 0})


We dropped the `customerID` column since it doesn't provide any useful information for predicting churn. We also encoded the categorical variables using one-hot encoding and scaled the numerical features using standard scaling. Finally, we created the target variable Churn by mapping the Yes and No values to 1 and 0, respectively.

## Data Analysis
Now that the data is cleaned and prepared, let's explore it further and extract some additional insights that may be valuable to a telecommunications company.

## Customer Demographics

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(12, 5))
sns.histplot(data=df, x='tenure', hue='Churn', bins=30, kde=True)
plt.xlabel('Tenure (Months)', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.title('Distribution of Tenure by Churn Status', fontsize=14)
plt.legend(labels=['Not Churn', 'Churn'])
plt.show()


We can see that customers who have been with the company for a shorter amount of time are more likely to churn. This suggests that the company may need to focus on improving customer retention strategies for new customers, such as providing more personalized onboarding experiences or special promotions for new sign-ups.

## Account Information

In [None]:
plt.figure(figsize=(8, 5))
sns.catplot(data=df, x='InternetService_Fiber optic', y='MonthlyCharges', hue='Churn', kind='box', height=5, aspect=1.5)
plt.xlabel('Fiber Optic Internet', fontsize=12)
plt.ylabel('Monthly Charges', fontsize=12)
plt.title('Monthly Charges by Fiber Optic Internet and Churn Status', fontsize=14)
plt.show()

We can see that customers who have fiber optic internet are more likely to churn and also tend to have higher monthly charges. This suggests that the company may need to investigate the quality of their fiber optic service and explore ways to make it more affordable for customers. 

## Service Usage

In [None]:
plt.figure(figsize=(8, 5))
sns.boxplot(data=df, x='Churn', y='TotalCharges')
plt.xlabel('Churn Status', fontsize=12)
plt.ylabel('Total Charges', fontsize=12)
plt.title('Total Charges by Churn Status', fontsize=14)
plt.show()

We can see that customers who have churned tend to have lower total charges than customers who have not churned. This suggests that the company may need to consider offering more incentives or promotions to long-term customers in order to encourage them to stay.

## Model Building and Evaluation

Now that we have a better understanding of the data, let's build a machine learning model to predict customer churn. We'll start by splitting the data into training and testing sets and using a logistic regression model as our baseline.

In [None]:
# Split the data into training and testing sets
from sklearn.model_selection import train_test_split

X = df.drop('Churn', axis=1)
y = df['Churn']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit a logistic regression model
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(random_state=42)
lr.fit(X_train, y_train)

# Evaluate the model
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

y_pred = lr.predict(X_test)

print('Accuracy:', accuracy_score(y_test, y_pred))
print('Precision:', precision_score(y_test, y_pred))
print('Recall:', recall_score(y_test, y_pred))
print('F1 Score:', f1_score(y_test, y_pred))

We achieved an accuracy of 0.789 and an F1 score of 0.536 with the random forest model, which is slightly worse than the logistic regression model. 
However, we can still try tuning the hyperparameters of the random forest model to see if we can improve its performance.

## Conclusion
In this portfolio project, we used machine learning to analyze customer data and predict which customers are likely to churn for a telecommunications company. 
We explored the data in greater depth and provided additional insights that may be valuable to the company. 
We also built and evaluated a machine learning model, achieving a decent baseline