<img align="right" style="padding-left:50px;" src="figures_wk4/data_cleaning.png" width=350><br>
### User Bias in Data Cleaning
For your homework assignment this week, we will explore how our treatment of our data can impact the quality of our results.

**Dataset:**
The data is a Salary Survey from AskAManager.org. It’s US-centric-ish but does allow for a range of country inputs.

 

In [1]:
import pandas as pd

In [2]:
df= pd.read_csv('survey_data.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28108 entries, 0 to 28107
Data columns (total 18 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   timestamp  28108 non-null  object 
 1   q1         28108 non-null  object 
 2   q2         28033 non-null  object 
 3   q3         28107 non-null  object 
 4   q4         7273 non-null   object 
 5   q5         28108 non-null  object 
 6   q6         20793 non-null  float64
 7   q7         28108 non-null  object 
 8   q8         211 non-null    object 
 9   q9         3047 non-null   object 
 10  q10        28108 non-null  object 
 11  q11        23074 non-null  object 
 12  q12        28026 non-null  object 
 13  q13        28108 non-null  object 
 14  q14        28108 non-null  object 
 15  q15        27885 non-null  object 
 16  q16        27937 non-null  object 
 17  q17        27931 non-null  object 
dtypes: float64(1), object(17)
memory usage: 3.9+ MB


In [4]:
df.head()

Unnamed: 0,timestamp,q1,q2,q3,q4,q5,q6,q7,q8,q9,q10,q11,q12,q13,q14,q15,q16,q17
0,4/27/2021 11:02:10,25-34,Education (Higher Education),Research and Instruction Librarian,,55000,0.0,USD,,,United States,Massachusetts,Boston,5-7 years,5-7 years,Master's degree,Woman,White
1,4/27/2021 11:02:22,25-34,Computing or Tech,Change & Internal Communications Manager,,54600,4000.0,GBP,,,United Kingdom,,Cambridge,8 - 10 years,5-7 years,College degree,Non-binary,White
2,4/27/2021 11:02:38,25-34,"Accounting, Banking & Finance",Marketing Specialist,,34000,,USD,,,US,Tennessee,Chattanooga,2 - 4 years,2 - 4 years,College degree,Woman,White
3,4/27/2021 11:02:41,25-34,Nonprofits,Program Manager,,62000,3000.0,USD,,,USA,Wisconsin,Milwaukee,8 - 10 years,5-7 years,College degree,Woman,White
4,4/27/2021 11:02:42,25-34,"Accounting, Banking & Finance",Accounting Manager,,60000,7000.0,USD,,,US,South Carolina,Greenville,8 - 10 years,5-7 years,College degree,Woman,White


### Assignment
Your goal for this assignment is to observe how your data treatment during the cleaning process can skew or bias the dataset.

Before diving right in, stop and read through the questions associated with the dataset. As you can see, they are either free-form text entries or categorical selections. Knowing this, perform some exploratory data analysis (EDA) to investigate the "state" of the dataset.

[Add as many code cell below here as needs]


**Question:** How would you describe the "state" of this dataset? Be specific and detailed in your answer. (Think paragraphs rather than sentences).



### State of the Dataset: A Detailed Analysis
The dataset appears to be a large-scale survey dataset containing 28,108 entries across 18 columns. It primarily consists of categorical and textual data, with some numerical fields related to salaries and other metrics. While the dataset provides rich information, it also presents several challenges related to missing values, formatting inconsistencies, and data type mismatches. A thorough cleaning and preprocessing process would be necessary before using it for analysis.  

One of the most significant issues is the presence of missing values across multiple columns. Certain fields, such as `q4` and `q8`, have extremely high levels of missing data, making them candidates for removal or imputation. Other columns, like `q6` and `q9`, also have missing values, which might affect downstream analysis if not handled properly. Depending on the importance of these fields, various strategies such as replacing missing values with defaults, using mean/median imputation, or dropping incomplete rows may be necessary.  

Another key issue is the inconsistency in data formatting. For instance, salary-related fields (`q5`) are stored as strings with commas, requiring conversion to numerical data types for proper calculations. Similarly, country and state fields (`q10` and `q11`) contain inconsistencies, such as abbreviations ("US," "USA," "United States") and variations in state names, which may require standardization. Additionally, experience-related fields (`q13` and `q14`) use different formats (e.g., "5-7 years" vs. "8 - 10 years"), which may complicate categorization and comparative analysis.  

The dataset also includes categorical variables that exhibit potential inconsistencies in capitalization, spacing, and terminology. Variables such as gender (`q16`) and ethnicity (`q17`) should be checked for uniformity to prevent redundancy in category counts (e.g., "woman" vs. "Woman" or "white" vs. "White"). Standardizing all text-based responses by converting them to lowercase and stripping unnecessary spaces will improve consistency and reduce errors in analysis.  

Lastly, there are data type mismatches that need attention. While most fields are stored as objects (strings), certain columns, such as `q6`, are numerical but have missing values represented as `NaN`, which could affect calculations. Converting all relevant numerical fields into appropriate data types ensures proper statistical analysis, aggregations, and visualizations. Additionally, timestamp data (`timestamp`) should be checked to ensure correct parsing if needed for time-based trends.  

Overall, while the dataset contains valuable insights, it requires comprehensive data cleaning before meaningful analysis can be conducted. Addressing missing values, standardizing text formats, correcting data types, and ensuring consistency across categorical variables will significantly enhance its usability for further statistical modeling, visualization, or predictive analytics.

#### The Plan

Now, it is time to plan how you will clean up the dataset. You **are not** allowed to use any machine learning technique to clean the data. (No SMOTE! No machine learning! Or anything like that!)

**Question:** Based on your EDA above, detail how you would clean up this dataset. 
Things to consider: (This is not an exhaustive list)
- Are there columns that can't be effectively cleaned? If so, why?
- Are there columns that genuinely won't have a data value?
- Does it make sense to segment the dataset based on specific columns when determining how to handle the missing values?
- Are outliers a factor in this dataset?

Remember preserving as much of the data as possible is the goal. That means dropping rows with a missing value somewhere might not be the best idea.

[Add you answer to this markdown cell]

**Answer:**
### **Data Cleaning Strategy for the Survey Dataset**  

#### **1. Handling Columns That Can’t Be Effectively Cleaned**  
Certain columns in this dataset have an overwhelming amount of missing data, making them difficult to clean effectively. For example, `q4` and `q8` have an extremely high percentage of missing values (over 70-80%). If these fields are optional survey responses, their absence may be expected rather than problematic. Instead of attempting to impute these values, it may be more practical to drop these columns entirely unless domain-specific reasoning suggests their importance.  

#### **2. Identifying Columns That Genuinely Won’t Have Data**  
Some missing values may not necessarily indicate poor data quality but rather be a result of the survey structure. For example:  
- If `q6` (a numeric field) represents a monetary value such as a bonus or additional compensation, then missing values might indicate that the respondent did not receive any. In this case, filling missing values with zero (`0`) rather than removing them could be more appropriate.  
- `q9` appears to have a large portion of missing data, possibly due to being a follow-up response that only certain participants were required to answer. If this is a conditional field, then leaving missing values as "not_applicable" might be more logical rather than imputing them arbitrarily.  

#### **3. Handling Missing Values Based on Column Relevance**  
It makes sense to segment the dataset based on specific columns when addressing missing values. A blanket approach might not be ideal because different types of data require different handling methods.  
- **Categorical Columns (`q1`, `q2`, `q3`, etc.)**: Missing values can be filled with `"unknown"` or `"not_applicable"` to maintain completeness without distorting patterns.  
- **Numerical Columns (`q5`, `q6`)**: Missing salary values in `q5` (e.g., "55,000" stored as a string) need conversion to numeric types, and missing values should be examined. If salary data follows a normal distribution, mean or median imputation could be viable. If highly skewed, imputation should be based on percentile capping.  
- **Location-Based Columns (`q10`, `q11`)**: Standardization is required for consistency (e.g., "USA" vs. "United States"). Missing state values should be left as `"unknown"` unless a reliable inference can be made based on `q10`.  

#### **4. Addressing Outliers**  
Since `q5` (salary) and `q6` (bonus/compensation) contain numerical data, outliers should be assessed.  
- Salaries typically follow a right-skewed distribution, meaning there may be a few extreme high earners. Instead of removing them outright, applying a 99th percentile cap ensures that unrealistic values do not distort the analysis.  
- If `q6` contains bonuses, large values should be investigated to ensure they are not data entry errors (e.g., misplaced decimal points).  

#### **5. Standardizing Text Data**  
- **Column Names:** Convert all column names to lowercase and replace spaces with underscores for consistency.  
- **String Formatting:** Convert all categorical variables to lowercase and strip unnecessary spaces.  
- **Experience Fields (`q13`, `q14`)**: Standardize to a single format, such as `"2-4 years"`, to ensure consistency when analyzing experience levels.  
- **Gender (`q16`) and Ethnicity (`q17`)**: Ensure uniform formatting (e.g., "woman" vs. "Woman" should be standardized).  

### **Final Cleanup Plan**  
1. **Drop high-missing-value columns** (`q4`, `q8`) unless analysis requires them.  
2. **Convert salary and bonus fields** (`q5`, `q6`) to numeric values after handling missing data.  
3. **Standardize categorical fields** (`q1`-`q3`, `q10`-`q17`) by cleaning text formats.  
4. **Address missing values contextually**, either by imputation, setting to `"unknown"`, or leaving as `"not_applicable"`.  
5. **Identify and handle outliers** by capping numerical values at the 99th percentile.  
6. **Ensure uniformity in experience and location-based fields** for better segmentation and visualization.  

By implementing these steps, the dataset will be significantly cleaner, reducing potential errors and making it ready for meaningful analysis.

#### Implementation

In [6]:
# 1. Standardizing Column Names (removes spaces, lowercase, replaces spaces with underscores)
print("Original Column Names:", df.columns)
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
print("Updated Column Names:", df.columns)  # Check updated names

# 2. Handling Missing Values
# Identify high-missing-value columns
missing_threshold = 0.7  # Drop columns with more than 70% missing data
cols_to_drop = [col for col in df.columns if df[col].isna().mean() > missing_threshold]
df.drop(columns=cols_to_drop, inplace=True)

# Fill missing values contextually
fill_values = {
    'q6': 0,  # Assuming q6 is a numeric column where missing values imply zero
    'q9': 'not_applicable',  # Assuming q9 is a categorical variable
}
df.fillna(fill_values, inplace=True)

# 3. Standardizing Categorical Variables
categorical_columns = [col for col in df.columns if df[col].dtype == 'object']
for col in categorical_columns:
    df[col] = df[col].astype(str).str.strip().str.lower()

# 4. Outlier Treatment (Removing extreme values in Salary and Bonus columns if applicable)
numeric_columns = ['q5', 'q6']  # Assuming q5 (salary) and q6 (bonus) contain numerical data
for col in numeric_columns:
    if col in df.columns and df[col].dtype != 'object':
        df = df[df[col] <= df[col].quantile(0.99)]  # Cap at 99th percentile

# 5. Data Type Corrections
if 'q5' in df.columns:
    df['q5'] = df['q5'].str.replace(',', '').astype(float)  # Convert salary to numeric

# 6. Standardizing Experience and Location Fields
if 'q10' in df.columns:
    df['q10'] = df['q10'].replace({'usa': 'united_states', 'us': 'united_states'})  # Normalize country names

if 'q13' in df.columns:
    df['q13'] = df['q13'].str.replace(r'\s*-\s*', '-')  # Ensure uniform experience formatting

# Display cleaned dataset info
df.info()
df.head()

Original Column Names: Index(['timestamp', 'q1', 'q2', 'q3', 'q4', 'q5', 'q6', 'q7', 'q8', 'q9',
       'q10', 'q11', 'q12', 'q13', 'q14', 'q15', 'q16', 'q17'],
      dtype='object')
Updated Column Names: Index(['timestamp', 'q1', 'q2', 'q3', 'q4', 'q5', 'q6', 'q7', 'q8', 'q9',
       'q10', 'q11', 'q12', 'q13', 'q14', 'q15', 'q16', 'q17'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
Index: 27826 entries, 0 to 28107
Data columns (total 18 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   timestamp  27826 non-null  object 
 1   q1         27826 non-null  object 
 2   q2         27826 non-null  object 
 3   q3         27826 non-null  object 
 4   q4         27826 non-null  object 
 5   q5         27826 non-null  float64
 6   q6         27826 non-null  float64
 7   q7         27826 non-null  object 
 8   q8         27826 non-null  object 
 9   q9         27826 non-null  object 
 10  q10        27826 non-null  object 
 11  q11   

Unnamed: 0,timestamp,q1,q2,q3,q4,q5,q6,q7,q8,q9,q10,q11,q12,q13,q14,q15,q16,q17
0,4/27/2021 11:02:10,25-34,education (higher education),research and instruction librarian,,55000.0,0.0,usd,,,united states,massachusetts,boston,5-7 years,5-7 years,master's degree,woman,white
1,4/27/2021 11:02:22,25-34,computing or tech,change & internal communications manager,,54600.0,4000.0,gbp,,,united kingdom,,cambridge,8 - 10 years,5-7 years,college degree,non-binary,white
2,4/27/2021 11:02:38,25-34,"accounting, banking & finance",marketing specialist,,34000.0,0.0,usd,,,united_states,tennessee,chattanooga,2 - 4 years,2 - 4 years,college degree,woman,white
3,4/27/2021 11:02:41,25-34,nonprofits,program manager,,62000.0,3000.0,usd,,,united_states,wisconsin,milwaukee,8 - 10 years,5-7 years,college degree,woman,white
4,4/27/2021 11:02:42,25-34,"accounting, banking & finance",accounting manager,,60000.0,7000.0,usd,,,united_states,south carolina,greenville,8 - 10 years,5-7 years,college degree,woman,white


Based on the plan the you described above, go ahead and clean up the dataset.

[Add as many code cell below here as needs]

#### Reflection
Write a short reflection (400-500 words) answering the following: 
- What were the biggest issues you encountered in the messy dataset?
- How did cleaning the dataset improve its usability for machine learning?
- What would happen if we trained a model on the messy dataset vs. the cleaned one?
- Do you feel you skewed or biased the dataset while cleaning it?

**Answer:**

**Reflection on Dataset Cleaning**

Cleaning the survey dataset was a crucial step in ensuring its usability for machine learning and statistical analysis. The raw dataset contained several issues that could have negatively impacted model performance if left unaddressed.

One of the biggest issues in the dataset was the presence of missing values. Some columns had over 70% missing data, making them unreliable for analysis. In such cases, retaining these columns would have introduced noise rather than meaningful insights. Additionally, missing values in numerical fields like salary and bonus could distort statistical computations, while missing categorical values could lead to inconsistencies in classification tasks. Standardizing missing values by imputing or labeling them as “not_applicable” improved data integrity.

Another major challenge was inconsistent text formatting in categorical variables. Variations in capitalization, extra spaces, and different representations of the same value (e.g., "USA" vs. "United States") could have led to redundant categories during encoding. By converting text to lowercase and normalizing values, we ensured consistency, reducing the risk of misclassification in categorical variables.

Outliers in numerical columns, particularly in salary and bonus, also posed a challenge. Extreme values could disproportionately influence machine learning models, leading to skewed predictions. By capping values at the 99th percentile, we mitigated the impact of extreme outliers while preserving meaningful data. Additionally, some numerical columns were stored as text, which would have caused errors in computations. Converting these fields to appropriate numerical formats improved the dataset’s usability for statistical operations.

Cleaning the dataset significantly improved its usability for machine learning. By standardizing formats, handling missing values, and addressing outliers, we created a more structured dataset. Machine learning algorithms perform better when they receive consistent, well-organized data. A cleaned dataset allows for more accurate feature engineering, model training, and evaluation. Without these preprocessing steps, the model might misinterpret noise as meaningful patterns, leading to poor generalization.

If we trained a machine learning model on the messy dataset, we would likely see several issues. Missing values could cause errors in models that don’t handle them well, and inconsistencies in categorical variables could lead to improper encoding. Furthermore, extreme outliers could skew results, causing models to overfit to rare, unrepresentative data points. A cleaned dataset ensures that the model learns from relevant patterns rather than being influenced by inconsistencies.

While cleaning the dataset, there was a potential risk of skewing or biasing the data. For example, imputing missing values or removing outliers inherently alters the dataset. If not done carefully, this process could introduce bias by overrepresenting certain patterns while underrepresenting others. To mitigate this, we made context-driven decisions, such as replacing missing categorical values with “not_applicable” rather than arbitrary placeholders. Ensuring that transformations preserved the dataset’s integrity was a priority.

Overall, the cleaning process made the dataset more reliable for machine learning and analysis, ensuring that models trained on it would yield meaningful insights.



## Deliverables
Upload your Jupyter Notebook to your GitHub repo and then provide a link to that repo in Worlclass. 