# Week 1 Assignment: Predicting Customer Churn with Logistic Regression

---

### **Objective**

The goal of this assignment is to build and evaluate a Logistic Regression model to predict customer churn for a telecommunications company. This task will take you through the fundamental steps of a real-world machine learning project: data exploration, preprocessing, model training, and performance evaluation.

### **Background & Problem Statement**

You are working as a Junior Data Scientist for a telecom company, "ConnectSphere." The company is facing a significant challenge with customer churn—customers who cancel their subscriptions. It is far more expensive to acquire a new customer than it is to retain an existing one.

Your manager has tasked you with analyzing a dataset of past customers to identify the key factors that lead to churn. Ultimately, you need to build a model that can predict whether a current customer is likely to churn. This will allow the marketing team to proactively offer retention incentives to at-risk customers.

### **Dataset**

You will be using the provided "Telco Customer Churn" dataset. It contains information about customer demographics, subscribed services, account information, and whether they churned.

#### **Key Columns to Note:**
*   `customerID`: Unique identifier for each customer.
*   `gender`, `SeniorCitizen`, `Partner`, `Dependents`: Customer demographic information.
*   `tenure`: Number of months the customer has stayed with the company.
*   `PhoneService`, `MultipleLines`, `InternetService`, etc.: Services subscribed to by the customer.
*   `MonthlyCharges`, `TotalCharges`: Account and payment information.
*   **`Churn`**: The target variable. 'Yes' if the customer churned, 'No' otherwise.

---

### **Tasks & Instructions**

Please structure your code (either in a Jupyter Notebook or a Python script) to follow these steps. Add comments or markdown cells to explain your process and interpret your results.

**1. Step 1: Setup and Data Loading**
   - Import necessary libraries (`pandas`, `numpy`, `sklearn`, `matplotlib`/`seaborn`).
   - Load the `Telco-Customer-Churn.csv` file into a pandas DataFrame.

**2. Step 2: Exploratory Data Analysis (EDA) & Preprocessing**
   - Inspect the first few rows of the DataFrame using `.head()`.
   - Use `.info()` to check data types and look for missing values.
     - *Hint: The `TotalCharges` column might be an 'object' type instead of a number. You will need to investigate why and convert it to a numeric type. Any rows that can't be converted should be handled appropriately (e.g., by dropping them).*
   - Get summary statistics with `.describe()`.
   - Analyze the target variable `Churn`. Is the dataset balanced? (i.e., what's the proportion of 'Yes' vs. 'No'?)
   - Convert the categorical target variable `Churn` into a numerical format (e.g., 'Yes' -> 1, 'No' -> 0).
   - Identify all other categorical columns in the dataset. Convert them into numerical format using an appropriate encoding technique (e.g., one-hot encoding with `pandas.get_dummies`).
   - The `customerID` column is not a useful feature for prediction. Make sure to drop it before training.

**3. Step 3: Feature Selection and Data Splitting**
   - Define your feature matrix `X` (all columns except the target) and your target vector `y` (the churn column).
   - Split your data into a training set (80%) and a testing set (20%) using `train_test_split` from scikit-learn. Use a `random_state` for reproducibility.

**4. Step 4: Model Training**
   - Instantiate a `LogisticRegression` model from scikit-learn.
   - Train (fit) the model on your training data (`X_train`, `y_train`).

**5. Step 5: Model Evaluation**
   - Make predictions on your testing data (`X_test`).
   - Calculate the following evaluation metrics:
     1.  **Accuracy:** What percentage of predictions were correct?
     2.  **Confusion Matrix:** Display the matrix to see the breakdown of True Positives, True Negatives, False Positives, and False Negatives.
     3.  **Precision:** Of all the customers your model predicted would churn, how many actually did?
     4.  **Recall (Sensitivity):** Of all the customers who actually churned, how many did your model correctly identify?
   - **Write a brief interpretation for each metric.** In the context of this business problem, is precision or recall more important? Why?

**6. Step 6: Conclusion **
   - Write a one-paragraph summary of your findings for your "manager." What does the model tell you, and how well does it perform at its task?

---

### **Submission Instructions**

1.  **Deadline:** You have **one week** from the assignment release date to submit your work.
2.  **Platform:** All submissions must be made to your allocated private GitLab repository. You **must** submit your work in a branch named `week_1`.
3.  **Format:** You can submit your work as either a Jupyter Notebook (`.ipynb`) or a Python script (`.py`).
4.  After pushing, you should verify that your branch and files are visible on the GitLab web interface. No further action is needed. The trainers will review all submissions on the `week_1` branch after the deadline. Any assignments submitted after the deadline won't be reviewed and will reflect in your course score.
5. The use of LLMs is encouraged, but ensure that you’re not copying solutions blindly. Always review, test, and understand any code generated, adapting it to the specific requirements of your assignment. Your submission should demonstrate your own comprehension, problem-solving process, and coding style, not just an unedited output from an AI tool.