<img align="right" style="padding-left:50px;" src="figures_wk4/data_cleaning.png" width=350><br>
## <b><u>User Bias in Data Cleaning</b></u>
For your homework assignment this week, we will explore how our treatment of our data can impact the quality of our results.

**Dataset:**
The data is a Salary Survey from AskAManager.org. It’s US-centric-ish but does allow for a range of country inputs.

 

In [1]:
import pandas as pd

In [2]:
df= pd.read_csv('telco.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 50 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Customer ID                        7043 non-null   object 
 1   Gender                             7043 non-null   object 
 2   Age                                7043 non-null   int64  
 3   Under 30                           7043 non-null   object 
 4   Senior Citizen                     7043 non-null   object 
 5   Married                            7043 non-null   object 
 6   Dependents                         7043 non-null   object 
 7   Number of Dependents               7043 non-null   int64  
 8   Country                            7043 non-null   object 
 9   State                              7043 non-null   object 
 10  City                               7043 non-null   object 
 11  Zip Code                           7043 non-null   int64

In [4]:
df.head()

Unnamed: 0,Customer ID,Gender,Age,Under 30,Senior Citizen,Married,Dependents,Number of Dependents,Country,State,...,Total Extra Data Charges,Total Long Distance Charges,Total Revenue,Satisfaction Score,Customer Status,Churn Label,Churn Score,CLTV,Churn Category,Churn Reason
0,8779-QRDMV,Male,78,No,Yes,No,No,0,United States,California,...,20,0.0,59.65,3,Churned,Yes,91,5433,Competitor,Competitor offered more data
1,7495-OOKFY,Female,74,No,Yes,Yes,Yes,1,United States,California,...,0,390.8,1024.1,3,Churned,Yes,69,5302,Competitor,Competitor made better offer
2,1658-BYGOY,Male,71,No,Yes,No,Yes,3,United States,California,...,0,203.94,1910.88,2,Churned,Yes,81,3179,Competitor,Competitor made better offer
3,4598-XLKNJ,Female,78,No,Yes,Yes,Yes,1,United States,California,...,0,494.0,2995.07,2,Churned,Yes,88,5337,Dissatisfaction,Limited range of services
4,4846-WHAFZ,Female,80,No,Yes,Yes,Yes,1,United States,California,...,0,234.21,3102.36,2,Churned,Yes,67,2793,Price,Extra data charges


### Assignment
Your goal for this assignment is to observe how your data treatment during the cleaning process can skew or bias the dataset.

Before diving right in, stop and read through the questions associated with the dataset. As you can see, they are either free-form text entries or categorical selections. Knowing this, perform some exploratory data analysis (EDA) to investigate the "state" of the dataset.

[Add as many code cell below here as needs]


**Question:** How would you describe the "state" of this dataset? Be specific and detailed in your answer. (Think paragraphs rather than sentences).



### State of the Dataset: A Detailed Analysis
The Telco dataset presents customer-related information, likely aimed at analyzing churn behavior. Upon initial inspection, the dataset contains a mix of categorical and numerical variables, with several potential data quality issues that require attention. These include missing values, inconsistent formatting, outliers, and data type mismatches.

1. Column Naming and Structure
Before cleaning, the dataset contained column names with inconsistent capitalization and spaces, which could lead to errors during data manipulation. This issue was resolved by standardizing all column names—converting them to lowercase and replacing spaces with underscores—ensuring uniformity and ease of access.

2. Missing Values
The dataset had missing values, particularly in columns like churn_category and churn_reason. These were replaced with "not_applicable" to maintain data integrity while preventing errors in analysis. Other missing values, if present, would need further investigation to determine whether imputation or removal is necessary.

3. Categorical Variables Standardization
Several categorical variables exhibited inconsistencies in text formatting, such as extra spaces and mixed capitalization. These were cleaned by converting all values to lowercase and stripping unnecessary spaces. Additionally, for customers without internet service, certain dependent columns (online_security, online_backup, etc.) contained values that were not applicable. These inconsistencies were corrected by explicitly setting them to "not_applicable" wherever necessary.

4. Outlier Treatment
The dataset contained numerical variables such as monthly_charges and total_charges, where extreme values could distort analysis. To mitigate this, outlier removal was performed by capping values at the 99th percentile, ensuring that highly unusual values do not disproportionately impact statistical analyses.

5. Data Type Corrections
Certain numerical fields, such as total_charges, were stored as strings instead of numeric values, likely due to data entry errors or formatting inconsistencies. These were converted to numeric data types, allowing for proper calculations and aggregations.

6. Overall Data Integrity and Readiness
After these cleaning steps, the dataset is in a much improved and structured state. It is now more consistent, with well-formatted categorical variables, properly handled missing values, and optimized numerical data. However, further exploratory data analysis (EDA) may be needed to identify deeper patterns or subtle inconsistencies.

In its current state, the dataset is well-prepared for descriptive analysis, predictive modeling, and visualization, making it a valuable resource for understanding customer churn patterns in the telecommunications industry.

#### The Plan

Now, it is time to plan how you will clean up the dataset. You **are not** allowed to use any machine learning technique to clean the data. (No SMOTE! No machine learning! Or anything like that!)

**Question:** Based on your EDA above, detail how you would clean up this dataset. 
Things to consider: (This is not an exhaustive list)
- Are there columns that can't be effectively cleaned? If so, why?
- Are there columns that genuinely won't have a data value?
- Does it make sense to segment the dataset based on specific columns when determining how to handle the missing values?
- Are outliers a factor in this dataset?

Remember preserving as much of the data as possible is the goal. That means dropping rows with a missing value somewhere might not be the best idea.

[Add you answer to this markdown cell]

**Answer:**
To clean the dataset effectively, the following steps will be taken:
    
1. **Handling Missing Values:**
   - 'Churn Category' and 'Churn Reason' are missing for non-churned customers. These can be labeled as 'Not Applicable' instead of being treated as missing.
   - Verify if any other columns contain incorrect or empty values.

2. **Standardizing Categorical Variables:**
   - Convert categorical columns to a consistent format (e.g., 'Yes'/'No' instead of mixed cases).
   - Encode categorical variables for analysis where necessary.

3. **Outlier Treatment:**
   - Examine numerical variables for extreme values (e.g., unusually high total charges or long-distance charges).
   - Decide if outliers should be removed or adjusted.

4. **Data Type Corrections:**
   - Convert 'Total Charges' from float to numeric format.
   - Ensure categorical data is stored as category type for efficiency.

5. **Checking Consistency:**
   - Verify if dependent variables align logically (e.g., 'Internet Service' customers should have 'Internet Type' filled in).
    
These steps will help ensure data quality while preserving valuable information.

#### Implementation

In [7]:
# Verify column names
print("Original Column Names:", df.columns)

# 1. Standardizing Column Names (removes spaces, lowercase, replaces spaces with underscores)
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")

print("Updated Column Names:", df.columns)  # Check updated names

# 2. Handling Missing Values
df = df.copy()  # Avoid chained assignment issues
df.fillna({'churn_category': 'not_applicable', 'churn_reason': 'not_applicable'}, inplace=True)

# 3. Standardizing Categorical Variables
categorical_columns = [col for col in df.columns if df[col].dtype == 'object']

for col in categorical_columns:
    df[col] = df[col].astype(str).str.strip().str.lower()

# 4. Outlier Treatment (Removing extreme values in Monthly Charges and Total Charges)
if 'monthly_charges' in df.columns:
    df = df[df['monthly_charges'] <= df['monthly_charges'].quantile(0.99)]
    
if 'total_charges' in df.columns:
    df = df[df['total_charges'] <= df['total_charges'].quantile(0.99)]

# 5. Data Type Corrections
if 'total_charges' in df.columns:
    df['total_charges'] = pd.to_numeric(df['total_charges'], errors='coerce')

# 6. Checking Consistency
if 'internet_service' in df.columns:
    no_internet_mask = df['internet_service'] == 'no'
    for col in ['online_security', 'online_backup', 'device_protection', 'tech_support', 'streaming_tv', 'streaming_movies']:
        if col in df.columns:
            df.loc[no_internet_mask, col] = 'not_applicable'

# Display cleaned dataset info
df.info()
df.head()


Original Column Names: Index(['Customer ID', 'Gender', 'Age', 'Under 30', 'Senior Citizen', 'Married',
       'Dependents', 'Number of Dependents', 'Country', 'State', 'City',
       'Zip Code', 'Latitude', 'Longitude', 'Population', 'Quarter',
       'Referred a Friend', 'Number of Referrals', 'Tenure in Months', 'Offer',
       'Phone Service', 'Avg Monthly Long Distance Charges', 'Multiple Lines',
       'Internet Service', 'Internet Type', 'Avg Monthly GB Download',
       'Online Security', 'Online Backup', 'Device Protection Plan',
       'Premium Tech Support', 'Streaming TV', 'Streaming Movies',
       'Streaming Music', 'Unlimited Data', 'Contract', 'Paperless Billing',
       'Payment Method', 'Monthly Charge', 'Total Charges', 'Total Refunds',
       'Total Extra Data Charges', 'Total Long Distance Charges',
       'Total Revenue', 'Satisfaction Score', 'Customer Status', 'Churn Label',
       'Churn Score', 'CLTV', 'Churn Category', 'Churn Reason'],
      dtype='object')
Up

Unnamed: 0,customer_id,gender,age,under_30,senior_citizen,married,dependents,number_of_dependents,country,state,...,total_extra_data_charges,total_long_distance_charges,total_revenue,satisfaction_score,customer_status,churn_label,churn_score,cltv,churn_category,churn_reason
0,8779-qrdmv,male,78,no,yes,no,no,0,united states,california,...,20,0.0,59.65,3,churned,yes,91,5433,competitor,competitor offered more data
1,7495-ookfy,female,74,no,yes,yes,yes,1,united states,california,...,0,390.8,1024.1,3,churned,yes,69,5302,competitor,competitor made better offer
2,1658-bygoy,male,71,no,yes,no,yes,3,united states,california,...,0,203.94,1910.88,2,churned,yes,81,3179,competitor,competitor made better offer
3,4598-xlknj,female,78,no,yes,yes,yes,1,united states,california,...,0,494.0,2995.07,2,churned,yes,88,5337,dissatisfaction,limited range of services
4,4846-whafz,female,80,no,yes,yes,yes,1,united states,california,...,0,234.21,3102.36,2,churned,yes,67,2793,price,extra data charges


Based on the plan the you described above, go ahead and clean up the dataset.

[Add as many code cell below here as needs]

#### Reflection
Write a short reflection (400-500 words) answering the following: 
- What were the biggest issues you encountered in the messy dataset?
- How did cleaning the dataset improve its usability for machine learning?
- What would happen if we trained a model on the messy dataset vs. the cleaned one?
- Do you feel you skewed or biased the dataset while cleaning it?

**Answer:**

**Reflection on Data Cleaning**

1. **Biggest Issues Encountered:**  
   - The dataset had missing values in 'Churn Category' and 'Churn Reason', which were only applicable to churned customers.
   - Some categorical columns contained inconsistencies in formatting (e.g., mixed cases, extra spaces).
   - Potential outliers in 'Monthly Charges' and 'Total Charges' that needed filtering.
   - Some categorical values were inconsistent with other fields (e.g., customers with 'No Internet Service' having entries in streaming service columns).

2. **Improvements in Usability for Machine Learning:**  
   - Missing values were properly handled to prevent bias or loss of crucial information.
   - Categorical variables were standardized, making them easier to encode.
   - Outliers were filtered to prevent extreme values from skewing the model.
   - Ensured logical consistency, which is crucial for reliable model predictions.

3. **Training on Messy vs. Cleaned Dataset:**  
   - Training on a messy dataset would introduce inconsistencies, making it harder for a model to learn meaningful patterns.
   - Incorrect data relationships (e.g., 'No Internet Service' with 'Streaming TV = Yes') could lead to misleading insights.
   - Cleaning ensures that the model generalizes better to real-world scenarios.

4. **Potential Bias Introduced:**  
   - By filling missing 'Churn Category' and 'Churn Reason' with 'Not Applicable', there is a risk of over-representing non-churned customers.
   - Removing outliers could eliminate valid but extreme customer behaviors.
   - Standardizing categories could mask subtle variations in customer data.

Overall, the cleaning process significantly improved dataset quality while balancing the risk of introducing bias.

## Deliverables
Upload your Jupyter Notebook to your GitHub repo and then provide a link to that repo in Worlclass. 