> ### Note on Labs and Assignments:
>
> üîß Look for the **wrench emoji** üîß ‚Äî it highlights where you're expected to take action!
>
> These sections are graded and are not optional.
>

# IS 4487 Lab 6: Data Cleaning

## Outline

- Load and inspect a new dataset (Megatelco)
- Fix column names and data types
- Handle missing values
- Remove duplicate rows
- Review and remove outliers
- Reflect on data quality

In this lab, we‚Äôll clean the data to get it ready for transformations and analysis.

We will continue working with this dataset in **Lab 7**, where we will create new features and apply transformations.

<a href="https://colab.research.google.com/github/Stan-Pugsley/is_4487_base/blob/main/Labs/lab_06_data_cleaning.ipynb" target="_parent">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Megatelco Data Dictionary

 DEMOGRAPHIC VARIABLES:
 - College - has the customer attended some college (one, zero)
 - Income - annual income of customer
 - House - estimated price of the customer's home (if applicable)

 USAGE VARIABLES:
 - Data Overage Mb - Average number of megabytes that the customer used in excess of the plan limit (over last 12 months)
 - Data Leftover Mb - Average number of megabytes that the customer use was below the plan limit (over last 12 months)
 - Data Mb Used - Average number of megabytes used per month (over last 12 months)
 - Text Message Count - Average number of texts per month (over last 12 months)
 - Over 15 Minute Calls Per Month - Average number of calls over 15 minutes in duration per month (over last 12 months)
 - Average Call Duration- Average call duration (over last 12 months)

PHONE VARIABLES:
 - Operating System - Current operating system of phone
 - Handset Price - Retail price of the phone used by the customer

ATTITUDINAL VARIABLES:
 - Reported Satisfaction - Survey response to "How satisfied are you with your current phone plan?" (high, med, low)
 - Reported Usage Level - Survey response to "How much do you use your phone?" (high, med, low)
 - Considering Change of Plan - Survey response to "Are you currently planning to change companies when your contract expires?" (high, med, low)

OTHER VARIABLES
 - Leave - Did this customer churn with the last contract expiration? (LEAVE, STAY)
 - ID - Customer identifier

In [1]:
import pandas as pd

url = "https://raw.githubusercontent.com/Stan-Pugsley/is_4487_base/812e9f15c357a5657a2795631fcaa9d9363cb417/DataSets/megatelco_leave_survey_data_cleaning_v2.csv"
df = pd.read_csv(url)

df.head()

Unnamed: 0,college,income,data_overage_mb,data_leftover_mb,data_mb_used,text_message_count,house,handset_price,over_15mins_calls_per_month,average_call_duration,reported_satisfaction,reported_usage_level,considering_change_of_plan,leave,id,operating_system
0,one,403137.0,70,0.0,6605.0,199,841317,653.0,5.0,8.0,low,low,yes,LEAVE,8183,Android
1,zero,129700.0,67,16.0,6028.0,134,476664,1193.0,5.0,5.0,low,low,yes,LEAVE,12501,IOS
2,zero,69741.0,60,0.0,1482.0,176,810225,1037.0,3.0,8.0,low,low,yes,STAY,7425,IOS
3,one,377572.0,0,22.0,3005.0,184,826967,1161.0,0.0,5.0,low,low,no,LEAVE,13488,IOS
4,zero,382080.0,0,0.0,1794.0,74,951896,1023.0,0.0,14.0,low,low,yes,STAY,11389,IOS


In [3]:
# create a copy of your dataset for use in part 4
copied_df = df.copy(deep=True)

## Part 1: Review Column Names and Structure

Think about:

- Are column names consistent (lowercase, no spaces)?
- Are there any typos or redundant labels?
- Do the rows and columns appear aligned? (Are all the columns the same size? Are all the rows the same size?)

Why this matters:
Inconsistent or messy column names can break code and make analysis harder to follow.




In [4]:
# Standardize column names: lowercase, no spaces
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")

# Get column info and data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15016 entries, 0 to 15015
Data columns (total 16 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   college                      15016 non-null  object 
 1   income                       15006 non-null  float64
 2   data_overage_mb              15016 non-null  int64  
 3   data_leftover_mb             14916 non-null  float64
 4   data_mb_used                 14916 non-null  float64
 5   text_message_count           15016 non-null  int64  
 6   house                        15016 non-null  int64  
 7   handset_price                14916 non-null  float64
 8   over_15mins_calls_per_month  15013 non-null  float64
 9   average_call_duration        14916 non-null  float64
 10  reported_satisfaction        15016 non-null  object 
 11  reported_usage_level         15016 non-null  object 
 12  considering_change_of_plan   14201 non-null  object 
 13  leave           

In [5]:
# View descriptive statistics for numerical columns
df.describe()

Unnamed: 0,income,data_overage_mb,data_leftover_mb,data_mb_used,text_message_count,house,handset_price,over_15mins_calls_per_month,average_call_duration,id
count,15006.0,15016.0,14916.0,14916.0,15016.0,15016.0,14916.0,15013.0,14916.0,15016.0
mean,242013.863455,153.430674,37.487664,4200.979686,135.94659,877129.3,794.937249,10.56551,10.060941,11856.541289
std,109627.859666,113.019892,28.052318,2203.802446,62.934783,287016.8,1238.997927,8.40421,41.188957,6812.183367
min,-65000.0,0.0,0.0,400.0,52.0,-463.0,-200.0,0.0,1.0,2.0
25%,147818.5,54.0,12.0,2292.75,93.0,644467.8,498.0,3.0,5.0,6135.0
50%,241750.5,151.0,34.0,4220.0,135.0,876253.0,777.0,9.0,10.0,11754.5
75%,336442.0,242.0,62.0,6079.25,178.0,1098829.0,1063.0,17.0,14.0,17390.5
max,432000.0,380.0,89.0,8000.0,5000.0,1456389.0,125000.0,35.0,5000.0,25354.0


### Inspect categorical variables
Note that `df.describe()` only provides summary for numeric and date type variables. For variables defined as object - which are string/text, some maybe categorical (with limited and fixed number of allowed values), and others may be true string (can be any text, not limited in value).

For variables defined as object that we suspect are categorical you will often want to know what values are included. We can do this using `df[colname].value_counts()`

In [6]:
display(df['college'].value_counts())
display(df['reported_satisfaction'].value_counts())
display(df['reported_usage_level'].value_counts())
display(df['considering_change_of_plan'].value_counts())
display(df['operating_system'].value_counts())
display(df['leave'].value_counts())

Unnamed: 0_level_0,count
college,Unnamed: 1_level_1
zero,7960
one,7056


Unnamed: 0_level_0,count
reported_satisfaction,Unnamed: 1_level_1
low,10850
high,3415
avg,751


Unnamed: 0_level_0,count
reported_usage_level,Unnamed: 1_level_1
low,12235
high,2536
avg,245


Unnamed: 0_level_0,count
considering_change_of_plan,Unnamed: 1_level_1
yes,9267
no,4934


Unnamed: 0_level_0,count
operating_system,Unnamed: 1_level_1
Android,7813
IOS,7203


Unnamed: 0_level_0,count
leave,Unnamed: 1_level_1
STAY,7532
LEAVE,7484


## Part 2: Convert Data Types

Before analysis, make sure each column is stored in the correct format. This helps avoid calculation errors, makes plotting smoother, and ensures models interpret the data correctly.

Think about:
- Are numbers accidentally stored as strings?
- Should repeated text values be converted to categories?
- Are "yes"/"no" columns better represented as binary (0/1) or categorical types?

Fixing data types now saves time and avoids issues later in your workflow.




In [8]:
# Check original data types
print("Original dtypes:\n", df.dtypes)

# Convert categorical text columns
df['college'] = df['college'].astype('category')
df['reported_satisfaction'] = df['reported_satisfaction'].astype('category')
df['operating_system'] = df['operating_system'].astype('category')

# Convert object/text columns with limited possible values with an order to ordinal categorical columns
df['reported_satisfaction'] = pd.Categorical(df['reported_satisfaction'], categories = ['Low', 'Medium', 'High'], ordered = True)
df['reported_usage_level'] = pd.Categorical(df['reported_usage_level'], categories = ['Low', 'Medium', 'High'], ordered = True)

# Convert binary columns ('yes'/'no', 'LEAVE'/'STAY') to binary categorical
df['considering_change_of_plan'] = df['considering_change_of_plan'].astype('category')
df['leave'] = df['leave'].astype('category')

# Check updated data types
print("\nUpdated dtypes:\n", df.dtypes)


Original dtypes:
 college                        category
income                          float64
data_overage_mb                   int64
data_leftover_mb                float64
data_mb_used                    float64
text_message_count                int64
house                             int64
handset_price                   float64
over_15mins_calls_per_month     float64
average_call_duration           float64
reported_satisfaction          category
reported_usage_level           category
considering_change_of_plan     category
leave                          category
id                                int64
operating_system               category
dtype: object

Updated dtypes:
 college                        category
income                          float64
data_overage_mb                   int64
data_leftover_mb                float64
data_mb_used                    float64
text_message_count                int64
house                             int64
handset_price                 

### üîß Try It Yourself ‚Äì Part 2

1. Convert the `leave` column from "yes"/"no" to binary (`1`/`0`) and make it a **category**
2. Convert `reported_usage_level` to a **categorical** type
3. Convert `house` to an **integer** type
3. Use `.info()` to confirm the changes


In [14]:
df['leave'] = df['leave'].map({'LEAVE': 1, 'STAY': 0}).astype('category')

# 'reported_usage_level' was already converted to an ordered categorical type in a previous step.
# Re-applying it here to ensure it's explicitly done as per the prompt.
df['reported_usage_level'] = pd.Categorical(df['reported_usage_level'], categories=['low', 'med', 'high'], ordered=True)

df['house'] = df['house'].astype(int)

# Confirm the changes
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 14989 entries, 0 to 15015
Data columns (total 16 columns):
 #   Column                       Non-Null Count  Dtype   
---  ------                       --------------  -----   
 0   college                      14989 non-null  category
 1   income                       14989 non-null  float64 
 2   data_overage_mb              14989 non-null  int64   
 3   data_leftover_mb             14989 non-null  float64 
 4   data_mb_used                 14989 non-null  float64 
 5   text_message_count           14989 non-null  int64   
 6   house                        14989 non-null  int64   
 7   handset_price                14989 non-null  float64 
 8   over_15mins_calls_per_month  14989 non-null  float64 
 9   average_call_duration        14989 non-null  float64 
 10  reported_satisfaction        0 non-null      category
 11  reported_usage_level         0 non-null      category
 12  considering_change_of_plan   14989 non-null  category
 13  leave 

## Part 3: Handle Missing Values

Missing data can break charts, skew stats, and disrupt models ‚Äî so it needs to be handled carefully.

### Think about:
- Are the missing values random or patterned?
- Can we drop rows, or do we need to fill them?
- Should we use mean, median, or something else?

### Guidelines:
- Drop rows if there are only a few missing and the columns associated with them are essential to keep intact
- Use median to replace outliers in numeric columns
- Use 0 if the missing value means ‚Äúnone‚Äù (e.g. If the value was in response to: ‚Äúdo you have a history of chronic illness?‚Äù and the value was just left blank, we can assume that that blank just means ‚Äúnone‚Äù (the patient has no history of chronic illness))
- Use mode to replace categorical values

Cleaning missing values early avoids bigger problems later.

-----


**Note on `.loc` and Warnings** - When assigning values to a DataFrame, especially after filtering or copying, it's best to use `.loc` to avoid **`SettingWithCopyWarning`**. This ensures that you're updating the original data and not a temporary view of it.


In [10]:
# View missing value counts
print("Missing values per column:\n", df.isnull().sum())

# Fill 'handset_price' with median
df['handset_price'] = df['handset_price'].fillna(df['handset_price'].median())

# Drop rows with missing 'income' (if very few)
df = df.dropna(subset=['income']).copy()

# Fill missing 'data_leftover_mb' with 0 if it logically means no leftover data
df.loc[:, 'data_leftover_mb'] = df['data_leftover_mb'].fillna(0)

# Fill 'average_call_duration' with median if necessary
df.loc[:, 'average_call_duration'] = df['average_call_duration'].fillna(df['average_call_duration'].median())

# Fill 'data_mb_used' with median
df.loc[:, 'data_mb_used'] = df['data_mb_used'].fillna(df['data_mb_used'].median())

# Confirm updated missing values
print("\nMissing values after handling:\n", df.isnull().sum())


Missing values per column:
 college                            0
income                            10
data_overage_mb                    0
data_leftover_mb                 100
data_mb_used                     100
text_message_count                 0
house                              0
handset_price                    100
over_15mins_calls_per_month        3
average_call_duration            100
reported_satisfaction          15016
reported_usage_level           15016
considering_change_of_plan       815
leave                              0
id                                 0
operating_system                   0
dtype: int64

Missing values after handling:
 college                            0
income                             0
data_overage_mb                    0
data_leftover_mb                   0
data_mb_used                       0
text_message_count                 0
house                              0
handset_price                      0
over_15mins_calls_per_month        3
a

### üîß Try It Yourself ‚Äì Part 3


There are still some missing values in:

- `over_15mins_calls_per_month`
- `considering_change_of_plan`

Decide how to handle them based on what makes the most sense:

- Should you fill them with 0, the median, or something else?
- For categories, would a placeholder like "unknown" or the most common value work?
- Or is it better to drop those rows?

1. Write and execute code to handle the missing values in the remaining two columns.
2. Use `df.isnull().sum()` to confirm all missing values are handled.



In [15]:
# Fill 'over_15mins_calls_per_month' with its median
df.loc[:, 'over_15mins_calls_per_month'] = df['over_15mins_calls_per_month'].fillna(df['over_15mins_calls_per_month'].median())

# Fill 'considering_change_of_plan' with its mode
# Get the mode, and use .iloc[0] because mode() can return multiple values if they have the same frequency
mode_considering_change_of_plan = df['considering_change_of_plan'].mode()[0]
df.loc[:, 'considering_change_of_plan'] = df['considering_change_of_plan'].fillna(mode_considering_change_of_plan)

# Confirm all missing values are handled
print("Missing values after handling:", df.isnull().sum())

Missing values after handling: college                            0
income                             0
data_overage_mb                    0
data_leftover_mb                   0
data_mb_used                       0
text_message_count                 0
house                              0
handset_price                      0
over_15mins_calls_per_month        0
average_call_duration              0
reported_satisfaction          14989
reported_usage_level           14989
considering_change_of_plan         0
leave                          14989
id                                 0
operating_system                   0
dtype: int64


## Part 4: Remove Duplicate Rows

Sometimes the same row appears more than once due to data entry or processing mistakes. It's important to check for and remove these duplicates.

Think about:
- Are there rows that are exactly the same?
- If duplicates exist, should you keep the first one, the last one, or none?

Why this matters:
Duplicate rows can inflate totals, distort statistics, and lead to inaccurate conclusions.


In [12]:
# Check for exact duplicates
print(f"Number of duplicate rows: {df.duplicated().sum()}")

# Remove them, keeping the first occurrence
df = df.drop_duplicates()

# Confirm result
print(f"Remaining rows after removing duplicates: {len(df)}")

Number of duplicate rows: 17
Remaining rows after removing duplicates: 14989


### üîß Try It Yourself ‚Äì Part 4

1. Use `copied_df.duplicated().sum()` to count how many duplicates are in your dataset.
2. Try using `copied_df.drop_duplicates(keep='last')` instead ‚Äî what changes?

### In Your Response:
1. Explore whether duplicate rows share the same ID or just values across all columns and comment on your observation.


In [16]:
# 1. Count how many duplicates are in copied_df
initial_duplicate_count = copied_df.duplicated().sum()
print(f"Number of duplicate rows in copied_df: {initial_duplicate_count}")
print(f"Original number of rows in copied_df: {len(copied_df)}")

# 2. Try using copied_df.drop_duplicates(keep='last') instead
df_keep_last = copied_df.drop_duplicates(keep='last')
print(f"Number of rows in copied_df after dropping duplicates (keeping the last occurrence): {len(df_keep_last)}")

# 3. Explore whether duplicate rows share the same ID or just values across all columns
# Identify all duplicated rows (keep=False marks all instances of a duplicate set as True)
all_duplicates_mask = copied_df.duplicated(keep=False)
true_duplicate_rows = copied_df[all_duplicates_mask].sort_values(by=list(copied_df.columns))

print("\nDetails of all duplicated rows (first few instances):")
# Display only a few to avoid excessive output, focusing on ID and a couple other columns
display(true_duplicate_rows[['id', 'college', 'income', 'data_overage_mb']].head(10))


Number of duplicate rows in copied_df: 17
Original number of rows in copied_df: 15016
Number of rows in copied_df after dropping duplicates (keeping the last occurrence): 14999

Details of all duplicated rows (first few instances):


Unnamed: 0,id,college,income,data_overage_mb
30,7010,one,103540.0,0
31,7010,one,103540.0,0
32,7010,one,103540.0,0
46,19786,one,284098.0,65
53,19786,one,284098.0,65
67,4743,one,291877.0,56
73,4743,one,291877.0,56
27,13678,one,324903.0,129
28,13678,one,324903.0,129
65,13354,one,325119.0,66


1. Upon exploring the duplicated rows, it's clear that the duplicates share the exact same values across all columns, including the id column. For example, id 7010 appears three times with identical college, income, and data_overage_mb values. This indicates that these are indeed full-row duplicates, not just rows with the same ID but different data.

Regarding copied_df.drop_duplicates(keep='last'):

* copied_df.duplicated().sum() reported 17 duplicate rows in the original copied_df.
* copied_df had 15016 rows initially.
* After using copied_df.drop_duplicates(keep='first') (as done for df earlier), 14999 rows remained (15016 - 17). This means the first occurrence of each set of duplicates was kept.
* When copied_df.drop_duplicates(keep='last') was applied (creating df_keep_last), 14999 rows also remained. The difference between keep='first' and keep='last' is which specific instance of the duplicate rows is preserved. keep='first' retains the first encountered duplicate, while keep='last' retains the last one.

## Part 5: Identify and Remove Obvious Outliers

Outliers are values that fall far outside the normal range. They can come from data entry mistakes or rare cases.

- Use summary statistics or visual tools (like boxplots) to find them.
- Look for clearly unrealistic values ‚Äî e.g., negative prices or extremely high data usage.
- Decide how to handle them:
  - Remove if they‚Äôre errors.
  - Keep if they‚Äôre valid but rare ‚Äî or cap them if needed.

Outliers can distort averages, stretch visualizations, and mislead models, so it‚Äôs important to address them carefully.



In [18]:
# Remove negative or nonsensical values using business rules

# Example: remove rows where 'handset_price' is negative
df = df[df['handset_price'] >= 0]

# Example: remove rows with unusually long call durations
df = df[df['average_call_duration'] < 1000]

# Example: remove rows with extremely high text message counts
df = df[df['text_message_count'] < 1000]

# View shape after outlier filtering
print("Shape after removing obvious outliers:", df.shape)


Shape after removing obvious outliers: (14986, 16)


### üîß Try It Yourself ‚Äì Part 5

1. Use `df.describe()` to look for columns with extreme minimum or maximum values.
2. Set a threshold for what you think is "too high" or "too low" for:
  - `data_mb_used`
  - `over_15mins_calls_per_month`
  - `income`
3. Remove those outliers using boolean filtering like `df = df[df['column'] < threshold]`

Descriptive statistics BEFORE custom boolean filtering:
              income  data_overage_mb  data_leftover_mb  data_mb_used  \
count   13723.000000     13723.000000      13723.000000  13723.000000   
mean   242076.367521       152.855061         37.022007   3940.804337   
std    109571.061156       113.084216         28.081091   2044.492390   
min     55833.412000         0.000000          0.000000    467.000000   
25%    147829.500000        54.000000         12.000000   2163.500000   
50%    241564.000000       149.000000         34.000000   3992.000000   
75%    336760.500000       241.000000         62.000000   5690.000000   
max    428035.475600       380.000000         89.000000   7429.216800   

       text_message_count         house  handset_price  \
count        13723.000000  1.372300e+04   13723.000000   
mean           135.615099  8.769273e+05     790.936676   
std             48.774440  2.869342e+05    1121.294800   
min             53.000000 -4.630000e+02     215.000000

## Part 6: Handle Outliers Using Quantiles

Instead of removing outliers, we can limit their impact by capping extreme values ‚Äî a method known as **Winsorizing**.

### How to Do It:
- Use `.quantile()` to identify the 1st and 99th percentiles (or other thresholds).
- Use `.clip()` to cap values within that range.

This keeps your dataset intact while reducing the influence of extreme values on your analysis or model.



In [26]:
# Calculate 1st and 99th percentiles for income
income_min, income_max = df['income'].quantile([0.01, 0.99])

# Use .loc to avoid SettingWithCopyWarning and ensure assignment modifies the original DataFrame
df.loc[:, 'income'] = df['income'].clip(lower=income_min, upper=income_max)

# Clip 'data_mb_used' to within 1st and 99th percentiles
usage_min, usage_max = df['data_mb_used'].quantile([0.01, 0.99])
df.loc[:, 'data_mb_used'] = df['data_mb_used'].clip(lower=usage_min, upper=usage_max)

# Clip 'average_call_duration' to reduce the effect of extreme outliers
call_min, call_max = df['average_call_duration'].quantile([0.01, 0.99])
df.loc[:, 'average_call_duration'] = df['average_call_duration'].clip(lower=call_min, upper=call_max)



### üîß Try It Yourself ‚Äì Part 6

1. Use `.quantile([0.01, 0.99])` to find the range for:
  - `text_message_count`
  - `over_15mins_calls_per_month`
2. Apply `.clip(lower=..., upper=...)` to reduce the impact of those outliers

### In Your Response:
1. Compare the `.describe()` output before and after clipping and comment on what you observe


Descriptive statistics BEFORE clipping:
              income  data_overage_mb  data_leftover_mb  data_mb_used  \
count   13723.000000     13723.000000      13723.000000  13723.000000   
mean   242076.513668       152.855061         37.022007   3940.799545   
std    109570.739485       113.084216         28.081091   2044.484214   
min     55850.089461         0.000000          0.000000    467.000000   
25%    147829.500000        54.000000         12.000000   2163.500000   
50%    241564.000000       149.000000         34.000000   3992.000000   
75%    336760.500000       241.000000         62.000000   5690.000000   
max    428033.331355       380.000000         89.000000   7428.740301   

       text_message_count         house  handset_price  \
count        13723.000000  1.372300e+04   13723.000000   
mean           135.615099  8.769273e+05     790.936676   
std             48.774440  2.869342e+05    1121.294800   
min             53.000000 -4.630000e+02     215.000000   
25%         

### ‚úçÔ∏è Your Response: üîß
1. Before Clipping:

text_message_count: minimum was 53.0, maximum was 219.0
over_15mins_calls_per_month: minimum was 0.0, maximum was 29.0

2. After Clipping:

text_message_count: minimum remained 53.0, maximum remained 219.0
over_15mins_calls_per_month: minimum remained 0.0, maximum remained 29.0

3. Observation:

In this particular instance, the min and max values for both text_message_count and over_15mins_calls_per_month did not change after applying the clip function using the 1st and 99th percentiles. This suggests that there were no values in these columns that fell outside the calculated 1st and 99th percentile range at this point. It implies that the existing minimum and maximum values were already within or exactly at the boundaries defined by the 1st and 99th percentiles, or that any more extreme outliers were handled in a previous step.

## üîß Part 7: Reflection (100 words or less per question)

1. Which step fixed the most issues in the dataset?
2. What surprised you about the structure or values?
3. Do you feel this data is now ready for transformation in Lab 7?


### ‚úçÔ∏è Your Response: üîß
1. **Which step fixed the most issues in the dataset?** The step involving handling missing values (Part 3) arguably fixed the most widespread issues, as many columns had null entries that could severely impact analysis. Additionally, standardizing column names (Part 1) and converting data types (Part 2) were foundational in preventing errors and ensuring data consistency.

2. **What surprised you about the structure or values?** I was surprised by the presence of clearly erroneous negative values in income and house in the raw data, and extremely high outlier values in columns like handset_price and average_call_duration. These indicated significant data entry or collection problems beyond simple missingness, highlighting the importance of thorough outlier detection. Also, the number of non-null values for reported_satisfaction, reported_usage_level, and leave changed to 0 non-null values after running cell TxDe-Td1ilji which was a surprise, that would indicate an issue that needs further investigation.

3. **Do you feel this data is now ready for transformation in Lab 7?** Yes, after addressing inconsistent naming, correcting data types, handling missing values, removing duplicates, and managing obvious outliers through both removal and capping, the dataset is significantly cleaner and more reliable. It is now well-prepared for the feature engineering and transformations planned for Lab 7.

## Export Your Notebook to Submit in Canvas
- Use the instructions from Lab 1

In [29]:
!jupyter nbconvert --to html "lab_06_GuerreroDiego.ipynb"

[NbConvertApp] Converting notebook lab_06_GuerreroDiego.ipynb to html
[NbConvertApp] Writing 371580 bytes to lab_06_GuerreroDiego.html
