# Customer Churn Prediction in a Telecommunications Company

## Introduction

The primary objective of this project was to develop a predictive model to identify customers at risk of churning for a telecommunications company. By predicting churn, the company can take proactive measures to retain these customers, thereby improving its revenue and customer satisfaction.

## Exploratory Data Analysis

Initally, the dataset was examined , which included:


1.   Customer Demographics: Variables : `gender`,`SeniorCitizen`,`Partner`,`Dependents`
2.  Service Usage: Variables related to phone and internet services like `PhoneService`, `MultipleLines`, `InternetService`, `OnlineSecurity`, `OnlineBackup`, `DeviceProtection`, `TechSupport`, `StreamingTV`, and `StreamingMovies`.
3.  Account Information : Variables like `Contract`, `PaperlessBilling`,
 `PaymentMethod`, `MonthlyCharges`, and
 `TotalCharges`.
4. `Churn`: This is the class label also known as target variable that tells us whether the customer churned or not.



### Key Observations from EDA



*   The `TotalCharges` column contained some non-numeric values, which were converted to numeric after handling missing values.
*   There were significant imbalances in some categorical variables, such as `Churn`, where non-churned customers were more prevalent.
*   The distribution of `TotalCharges` showed right-skewness, indicating that most customers had lower charges, with fewer customers having higher charges.



## Data Preprocessing



*   Handling Missing Values: For `TotalCharges`, missing values were imputed using the median.
*   Encoding Categorical Variables: Binary categorical variables were encoded using binary encoding, while multi-category variables were one-hot encoded.





## Feature engineering

**Correlation Analysis:** A correlation matrix was used to identify features with a correlation higher than 0.1 or lower than -0.1 with the target variable `Churn`. This step was crucial to reduce dimensionality and focus on the most relevant features.
The final features selected included `tenure`, `MonthlyCharges`, `TotalCharges`, `Contract` (one-hot encoded), and other relevant service-related features.

## Model Selection and Training

Several machine learning algorithms were considered and trained:


1.   Logistic Regression
2.   Random Forest
3.   Gradient boosting



### Visualization of Model Performance
A comparison of the accuracy of different models was visualized using a bar plot.

For this project, Gradient Boosting was selected for further hyperparameter tuning due to its superior performance during initial evaluations.

### **Hyperparameter Tuning**
Hyperparameter tuning for the Gradient Boosting model was conducted using Randomized Search to maximize the model's performance.

Randomized Search: Explored a wide range of hyperparameters, such as n_estimators, learning_rate, max_depth, min_samples_split, min_samples_leaf, and subsample.

## Evaluation Results

Each model was evaluated using accuracy.

The results for Gradient Boosting were:
*   Accuracy: 0.77
*   Precision: 0.55
*   Recall: 0.77
*   F1-Score: 0.64

These results are rounded off to 2 decimal points.

The confusion matrix for the tuned Gradient Boosting model highlighted the number of true positives, true negatives, false positives, and false negatives, showcasing the model's balanced performance.

## Challenges Faced
**Imbalanced Dataset**: The `Churn` variable was imbalanced, with more non-churned customers. Techniques such as SMOTE (Synthetic Minority Over-sampling Technique) is considered to address this imbalance.
**Feature Selection**: Identifying the most relevant features without overfitting was crucial. This was managed through correlation analysis and iterative experimentation with different subsets of features.
**Hyperparameter Tuning**: Finding the optimal hyperparameters for models, particularly Gradient Boosting, required extensive search and cross-validation, which was computationally intensive.

## Conclusion

The project successfully developed a predictive model for customer churn, with Gradient Boosting achieving the highest accuracy. The insights gained from EDA and feature engineering were crucial in building an effective model. Despite challenges like imbalanced data and feature selection, the project provided valuable predictions that can help the telecommunications company in retaining customers and reducing churn.

Future work could explore advanced techniques like ensemble methods and deep learning, as well as strategies to handle class imbalance more effectively. Additionally, integrating customer feedback and service interaction data might further enhance the model's predictive power.










---

