# **Data Wrangling Lab**


- Identify and remove inconsistent data entries.

- Encode categorical variables for analysis.

- Handle missing values using multiple imputation strategies.

- Apply feature scaling and transformation techniques.


#### Intsall the required libraries


In [1]:
!pip install pandas
!pip install matplotlib



## Tasks


#### Step 1: Import the necessary module.


### 1. Load the Dataset


<h5>1.1 Import necessary libraries and load the dataset.</h5>


In [2]:
# Import necessary libraries
import pandas as pd

# Load the Stack Overflow survey data
dataset_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv"
df = pd.read_csv(dataset_url)

# Display the first few rows
print(df.head())


   ResponseId                      MainBranch                 Age  \
0           1  I am a developer by profession  Under 18 years old   
1           2  I am a developer by profession     35-44 years old   
2           3  I am a developer by profession     45-54 years old   
3           4           I am learning to code     18-24 years old   
4           5  I am a developer by profession     18-24 years old   

            Employment RemoteWork   Check  \
0  Employed, full-time     Remote  Apples   
1  Employed, full-time     Remote  Apples   
2  Employed, full-time     Remote  Apples   
3   Student, full-time        NaN  Apples   
4   Student, full-time        NaN  Apples   

                                    CodingActivities  \
0                                              Hobby   
1  Hobby;Contribute to open-source projects;Other...   
2  Hobby;Contribute to open-source projects;Other...   
3                                                NaN   
4                                 

#### 2. Explore the Dataset


<h5>2.1 Summarize the dataset by displaying the column data types, counts, and missing values.</h5>


In [3]:

# Display data types and non-null counts
print("Column Info:")
print(df.info())

# Display missing values
print("\nMissing Values per Column:")
print(df.isnull().sum())

Column Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65437 entries, 0 to 65436
Columns: 114 entries, ResponseId to JobSat
dtypes: float64(13), int64(1), object(100)
memory usage: 56.9+ MB
None

Missing Values per Column:
ResponseId                 0
MainBranch                 0
Age                        0
Employment                 0
RemoteWork             10631
                       ...  
JobSatPoints_11        35992
SurveyLength            9255
SurveyEase              9199
ConvertedCompYearly    42002
JobSat                 36311
Length: 114, dtype: int64


<h5>2.2 Generate basic statistics for numerical columns.</h5>


In [4]:
# Display basic statistics for numerical columns
print(df.describe())

         ResponseId      CompTotal       WorkExp  JobSatPoints_1  \
count  65437.000000   3.374000e+04  29658.000000    29324.000000   
mean   32719.000000  2.963841e+145     11.466957       18.581094   
std    18890.179119  5.444117e+147      9.168709       25.966221   
min        1.000000   0.000000e+00      0.000000        0.000000   
25%    16360.000000   6.000000e+04      4.000000        0.000000   
50%    32719.000000   1.100000e+05      9.000000       10.000000   
75%    49078.000000   2.500000e+05     16.000000       22.000000   
max    65437.000000  1.000000e+150     50.000000      100.000000   

       JobSatPoints_4  JobSatPoints_5  JobSatPoints_6  JobSatPoints_7  \
count    29393.000000    29411.000000    29450.000000     29448.00000   
mean         7.522140       10.060857       24.343232        22.96522   
std         18.422661       21.833836       27.089360        27.01774   
min          0.000000        0.000000        0.000000         0.00000   
25%          0.000000 

### 3. Identifying and Removing Inconsistencies


<h5>3.1 Identify inconsistent or irrelevant entries in specific columns (e.g., Country).</h5>


In [5]:
# Write your code here
def identify_entries(df,column_name):

    columns_to_display = ["ResponseId", column_name]

    # Check if there is any null or empty value
    null_or_empty = df[df[column_name].isnull() | (df[column_name]=='')][columns_to_display]

    # Check if any leading or trailing spaces
    leading_trailing_spaces = df[df[column_name] != df[column_name].str.strip()][columns_to_display] 

    # Check for any irrelevant entry
    irrelevant_entries = df[df[column_name].isin(["N/A", "None", "Other", "Unknown", "Unspecified"])][columns_to_display]

    # Check for inconsistent capitalization
    inconsistent_capitalization = df[df[column_name] != df[column_name].str.title()][columns_to_display]

    # Check for numeric entries
    #non_alpha_entries = df[~df[column_name].str.isalpha()]

    # Print the values
    print(f"Null or Empty entries: \n {null_or_empty}")
    print(f"Entries with leading or trailing spaces: \n {leading_trailing_spaces}")
    print(f"Irrelevant Entries: \n {irrelevant_entries}")
    print(f"Entries with inconsistent capitalization: \n {inconsistent_capitalization}")
    #print(f"Non Alphabetic entries: \n {non_alpha_entries}")

identify_entries(df,"Country")

Null or Empty entries: 
        ResponseId Country
43448       43449     NaN
43454       43455     NaN
43459       43460     NaN
43460       43461     NaN
43461       43462     NaN
...           ...     ...
65430       65431     NaN
65432       65433     NaN
65433       65434     NaN
65434       65435     NaN
65436       65437     NaN

[6507 rows x 2 columns]
Entries with leading or trailing spaces: 
        ResponseId Country
43448       43449     NaN
43454       43455     NaN
43459       43460     NaN
43460       43461     NaN
43461       43462     NaN
...           ...     ...
65430       65431     NaN
65432       65433     NaN
65433       65434     NaN
65434       65435     NaN
65436       65437     NaN

[6507 rows x 2 columns]
Irrelevant Entries: 
 Empty DataFrame
Columns: [ResponseId, Country]
Index: []
Entries with inconsistent capitalization: 
        ResponseId                                            Country
0               1                           United States of Ameri

<h5>3.2 Standardize entries in columns like Country or EdLevel by mapping inconsistent values to a consistent format.</h5>


In [6]:
print(f"Unique Values in column 'Country': \n {df["Country"].unique()}\n")
print(f"Unique Values in column 'EdLevel': \n {df["EdLevel"].unique()}\n")

Unique Values in column 'Country': 
 ['United States of America'
 'United Kingdom of Great Britain and Northern Ireland' 'Canada' 'Norway'
 'Uzbekistan' 'Serbia' 'Poland' 'Philippines' 'Bulgaria' 'Switzerland'
 'India' 'Germany' 'Ireland' 'Italy' 'Ukraine' 'Australia' 'Brazil'
 'Japan' 'Austria' 'Iran, Islamic Republic of...' 'France' 'Saudi Arabia'
 'Romania' 'Turkey' 'Nepal' 'Algeria' 'Sweden' 'Netherlands' 'Croatia'
 'Pakistan' 'Czech Republic' 'Republic of North Macedonia' 'Finland'
 'Slovakia' 'Russian Federation' 'Greece' 'Israel' 'Belgium' 'Mexico'
 'United Republic of Tanzania' 'Hungary' 'Argentina' 'Portugal'
 'Sri Lanka' 'Latvia' 'China' 'Singapore' 'Lebanon' 'Spain' 'South Africa'
 'Lithuania' 'Viet Nam' 'Dominican Republic' 'Indonesia' 'Kosovo'
 'Morocco' 'Taiwan' 'Georgia' 'San Marino' 'Tunisia' 'Bangladesh'
 'Nigeria' 'Liechtenstein' 'Denmark' 'Ecuador' 'Malaysia' 'Albania'
 'Azerbaijan' 'Chile' 'Ghana' 'Peru' 'Bolivia' 'Egypt' 'Luxembourg'
 'Montenegro' 'Cyprus' 'Paragua

In [7]:
# Dictionary to correct and standardize country names
country_mapping = {
    "United States of America": "United States",
    "United Kingdom of Great Britain and Northern Ireland": "United Kingdom",
    "Russian Federation": "Russia",
    "Viet Nam": "Vietnam",
    "Iran, Islamic Republic of...": "Iran",
    "Republic of Korea": "South Korea",
    "Democratic People's Republic of Korea": "North Korea",
    "Congo, Republic of the...": "Republic of the Congo",
    "Democratic Republic of the Congo": "DR Congo",
    "Venezuela, Bolivarian Republic of...": "Venezuela",
    "Libyan Arab Jamahiriya": "Libya",
    "Lao People's Democratic Republic": "Laos",
    "Brunei Darussalam": "Brunei",
    "Micronesia, Federated States of...": "Micronesia",
    "Côte d'Ivoire": "Ivory Coast",
    "Hong Kong (S.A.R.)": "Hong Kong"
}

# Apply corrections
df["Country"] = df["Country"].replace(country_mapping)

# Print CLeaned country column
print(df["Country"].value_counts())

Country
United States      11095
Germany             4947
India               4231
United Kingdom      3224
Ukraine             2672
                   ...  
Micronesia             1
Nauru                  1
Chad                   1
Djibouti               1
Solomon Islands        1
Name: count, Length: 183, dtype: int64


In [8]:
# Dictionary to correct and standardize EdLevel column
edLevel_mapping = {
    "Bachelor’s degree (B.A., B.S., B.Eng., etc.)": "Bachelor’s Degree",
    "Master’s degree (M.A., M.S., M.Eng., MBA, etc.)": "Master’s Degree",
    "Some college/university study without earning a degree": "Some Higher Education",
    "Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)": "High School Diploma",
    "Professional degree (JD, MD, Ph.D, Ed.D, etc.)": "Doctorate or Professional Degree",
    "Associate degree (A.A., A.S., etc.)": "Associate Degree",
    "Primary/elementary school": "Primary School",
    "Something else": "Other Education"
}

# Apply the mapping
df["EdLevel"] = df["EdLevel"].replace(edLevel_mapping)

# Print cleaned column
print(df["EdLevel"].value_counts())

EdLevel
Bachelor’s Degree                   24942
Master’s Degree                     15557
Some Higher Education                7651
High School Diploma                  5793
Doctorate or Professional Degree     2970
Associate Degree                     1793
Primary School                       1146
Other Education                       932
Name: count, dtype: int64


### 4. Encoding Categorical Variables


<h5>4.1 Encode the Employment column using one-hot encoding.</h5>


In [20]:

# One-hot encode the Employment column, splitting by ';'
df_expanded = df['Employment'].str.get_dummies(sep=';')

# Concatenate the one-hot encoded columns back to original dataframe
df = pd.concat([df, df_expanded], axis=1)
df

Unnamed: 0,ResponseId,MainBranch,Age,Employment,RemoteWork,Check,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,...,"student, part-time","employed, full-time","employed, part-time",i prefer not to say,"independent contractor, freelancer, or self-employed","not employed, and not looking for work","not employed, but looking for work",retired,"student, full-time","student, part-time.1"
0,1,I am a developer by profession,Under 18 years old,"employed, full-time",Remote,Apples,Hobby,Primary School,Books / Physical media,,...,0,1,0,0,0,0,0,0,0,0
1,2,I am a developer by profession,35-44 years old,"employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,Bachelor’s Degree,Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,...,0,1,0,0,0,0,0,0,0,0
2,3,I am a developer by profession,45-54 years old,"employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,Master’s Degree,Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,...,0,1,0,0,0,0,0,0,0,0
3,4,I am learning to code,18-24 years old,"student, full-time",,Apples,,Some Higher Education,"Other online resources (e.g., videos, blogs, f...",Stack Overflow;How-to videos;Interactive tutorial,...,0,0,0,0,0,0,0,0,1,0
4,5,I am a developer by profession,18-24 years old,"student, full-time",,Apples,,High School Diploma,"Other online resources (e.g., videos, blogs, f...",Technical documentation;Blogs;Written Tutorial...,...,0,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65432,65433,I am a developer by profession,18-24 years old,"employed, full-time",Remote,Apples,Hobby;School or academic work,Bachelor’s Degree,"On the job training;School (i.e., University, ...",,...,0,1,0,0,0,0,0,0,0,0
65433,65434,I am a developer by profession,25-34 years old,"employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects,,,,...,0,1,0,0,0,0,0,0,0,0
65434,65435,I am a developer by profession,25-34 years old,"employed, full-time",In-person,Apples,Hobby,Bachelor’s Degree,"Other online resources (e.g., videos, blogs, f...",Technical documentation;Stack Overflow;Social ...,...,0,1,0,0,0,0,0,0,0,0
65435,65436,I am a developer by profession,18-24 years old,"employed, full-time","Hybrid (some remote, some in-person)",Apples,Hobby;Contribute to open-source projects;Profe...,High School Diploma,On the job training;Other online resources (e....,Technical documentation;Blogs;Written Tutorial...,...,0,1,0,0,0,0,0,0,0,0


### 5. Handling Missing Values


<h5>5.1 Identify columns with the highest number of missing values.</h5>


In [22]:
# Sort columns by descending number of missing values
missing_counts_sorted = df.isnull().sum().sort_values(ascending=False)

# Display the columns with the most missing values (top 5 for example)
print(missing_counts_sorted)


AINextMuch less integrated                                                                                                    64289
AINextLess integrated                                                                                                         63082
AINextNo change                                                                                                               52939
AINextMuch more integrated                                                                                                    51999
EmbeddedAdmired                                                                                                               48704
                                                                                                                              ...  
Employment_student, full-time;not employed, but looking for work;student, part-time                                               0
Employment_student, full-time;not employed, but looking for work;retired    

<h5>5.2 Impute missing values in numerical columns (e.g., `ConvertedCompYearly`) with the mean or median.</h5>


In [33]:
# Select numerical columns
num_cols = df.select_dtypes(include=['number']).columns
#Impute missing values in numerical columns 
for col in num_cols:
    df[col].fillna(df[col].mean(), inplace=True)
df[num_cols]

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].mean(), inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[col].fillna(df[col].mean(), inplace=True)


Unnamed: 0,ResponseId,CompTotal,WorkExp,JobSatPoints_1,JobSatPoints_4,JobSatPoints_5,JobSatPoints_6,JobSatPoints_7,JobSatPoints_8,JobSatPoints_9,...,"not employed, and not looking for work","not employed, and not looking for work.1","not employed, but looking for work","not employed, but looking for work.1",retired,retired.1,"student, full-time","student, full-time.1","student, part-time","student, part-time.1"
0,1,2.963841e+145,11.466957,18.581094,7.52214,10.060857,24.343232,22.96522,20.278165,16.169432,...,0,0,0,0,0,0,0,0,0,0
1,2,2.963841e+145,17.000000,0.000000,0.00000,0.000000,0.000000,0.00000,0.000000,0.000000,...,0,0,0,0,0,0,0,0,0,0
2,3,2.963841e+145,11.466957,18.581094,7.52214,10.060857,24.343232,22.96522,20.278165,16.169432,...,0,0,0,0,0,0,0,0,0,0
3,4,2.963841e+145,11.466957,18.581094,7.52214,10.060857,24.343232,22.96522,20.278165,16.169432,...,0,0,0,0,0,0,1,1,0,0
4,5,2.963841e+145,11.466957,18.581094,7.52214,10.060857,24.343232,22.96522,20.278165,16.169432,...,0,0,0,0,0,0,1,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65432,65433,2.963841e+145,11.466957,18.581094,7.52214,10.060857,24.343232,22.96522,20.278165,16.169432,...,0,0,0,0,0,0,0,0,0,0
65433,65434,2.963841e+145,11.466957,18.581094,7.52214,10.060857,24.343232,22.96522,20.278165,16.169432,...,0,0,0,0,0,0,0,0,0,0
65434,65435,2.963841e+145,11.466957,18.581094,7.52214,10.060857,24.343232,22.96522,20.278165,16.169432,...,0,0,0,0,0,0,0,0,0,0
65435,65436,2.963841e+145,5.000000,0.000000,0.00000,0.000000,0.000000,0.00000,0.000000,0.000000,...,0,0,0,0,0,0,0,0,0,0


In [35]:
#recheck for empty rows in numeric columns after imputation
df[num_cols].isnull().sum()

ResponseId                                              0
CompTotal                                               0
WorkExp                                                 0
JobSatPoints_1                                          0
JobSatPoints_4                                          0
JobSatPoints_5                                          0
JobSatPoints_6                                          0
JobSatPoints_7                                          0
JobSatPoints_8                                          0
JobSatPoints_9                                          0
JobSatPoints_10                                         0
JobSatPoints_11                                         0
ConvertedCompYearly                                     0
JobSat                                                  0
employed, full-time                                     0
employed, full-time                                     0
employed, part-time                                     0
employed, part

<h5>5.3 Impute missing values in categorical columns (e.g., `RemoteWork`) with the most frequent value.</h5>


In [37]:
## Write your code here
categorical_columns = df.select_dtypes(include=['object','category']).columns

categorical_columns

Index(['MainBranch', 'Age', 'Employment', 'RemoteWork', 'Check',
       'CodingActivities', 'EdLevel', 'LearnCode', 'LearnCodeOnline',
       'TechDoc', 'YearsCode', 'YearsCodePro', 'DevType', 'OrgSize',
       'PurchaseInfluence', 'BuyNewTool', 'BuildvsBuy', 'TechEndorse',
       'Country', 'Currency', 'LanguageHaveWorkedWith',
       'LanguageWantToWorkWith', 'LanguageAdmired', 'DatabaseHaveWorkedWith',
       'DatabaseWantToWorkWith', 'DatabaseAdmired', 'PlatformHaveWorkedWith',
       'PlatformWantToWorkWith', 'PlatformAdmired', 'WebframeHaveWorkedWith',
       'WebframeWantToWorkWith', 'WebframeAdmired', 'EmbeddedHaveWorkedWith',
       'EmbeddedWantToWorkWith', 'EmbeddedAdmired', 'MiscTechHaveWorkedWith',
       'MiscTechWantToWorkWith', 'MiscTechAdmired', 'ToolsTechHaveWorkedWith',
       'ToolsTechWantToWorkWith', 'ToolsTechAdmired',
       'NEWCollabToolsHaveWorkedWith', 'NEWCollabToolsWantToWorkWith',
       'NEWCollabToolsAdmired', 'OpSysPersonal use', 'OpSysProfessional use

In [38]:
for col in categorical_columns:
    most_freq_val = df[col].mode()[0]
    df[col] = df[col].fillna(most_freq_val)

df[categorical_columns].head()

Unnamed: 0,MainBranch,Age,Employment,RemoteWork,Check,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,TechDoc,...,Frequency_3,TimeSearching,TimeAnswering,Frustration,ProfessionalTech,ProfessionalCloud,ProfessionalQuestion,Industry,SurveyLength,SurveyEase
0,I am a developer by profession,Under 18 years old,"employed, full-time",Remote,Apples,Hobby,Primary School,Books / Physical media,Technical documentation;Blogs;Written Tutorial...,API document(s) and/or SDK document(s);User gu...,...,1-2 times a week,30-60 minutes a day,15-30 minutes a day,None of these,None of these,Hybrid (on-prem and cloud),Traditional public search engine,Software Development,Appropriate in length,Easy
1,I am a developer by profession,35-44 years old,"employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,Bachelor’s Degree,Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,API document(s) and/or SDK document(s);User gu...,...,1-2 times a week,30-60 minutes a day,15-30 minutes a day,None of these,None of these,Hybrid (on-prem and cloud),Traditional public search engine,Software Development,Appropriate in length,Easy
2,I am a developer by profession,45-54 years old,"employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,Master’s Degree,Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,API document(s) and/or SDK document(s);User gu...,...,1-2 times a week,30-60 minutes a day,15-30 minutes a day,None of these,None of these,Hybrid (on-prem and cloud),Traditional public search engine,Software Development,Appropriate in length,Easy
3,I am learning to code,18-24 years old,"student, full-time","Hybrid (some remote, some in-person)",Apples,Hobby,Some Higher Education,"Other online resources (e.g., videos, blogs, f...",Stack Overflow;How-to videos;Interactive tutorial,API document(s) and/or SDK document(s);User gu...,...,1-2 times a week,30-60 minutes a day,15-30 minutes a day,None of these,None of these,Hybrid (on-prem and cloud),Traditional public search engine,Software Development,Too long,Easy
4,I am a developer by profession,18-24 years old,"student, full-time","Hybrid (some remote, some in-person)",Apples,Hobby,High School Diploma,"Other online resources (e.g., videos, blogs, f...",Technical documentation;Blogs;Written Tutorial...,API document(s) and/or SDK document(s);User gu...,...,1-2 times a week,30-60 minutes a day,15-30 minutes a day,None of these,None of these,Hybrid (on-prem and cloud),Traditional public search engine,Software Development,Too short,Easy


In [39]:
#recheck for missing values after imputation 
df[categorical_columns].isnull().sum()

MainBranch              0
Age                     0
Employment              0
RemoteWork              0
Check                   0
                       ..
ProfessionalCloud       0
ProfessionalQuestion    0
Industry                0
SurveyLength            0
SurveyEase              0
Length: 100, dtype: int64

### 6. Feature Scaling and Transformation


<h5>6.1 Apply Min-Max Scaling to normalize the `ConvertedCompYearly` column.</h5>


In [40]:
col = 'ConvertedCompYearly'

df[col] = (df[col] - df[col].min()) / (df[col].max() - df[col].min())
df["ConvertedCompYearly"].head()

0    0.0053
1    0.0053
2    0.0053
3    0.0053
4    0.0053
Name: ConvertedCompYearly, dtype: float64

<h5>6.2 Log-transform the ConvertedCompYearly column to reduce skewness.</h5>


In [43]:
import numpy as np
df["ConvertedCompYearly_log1p"] = np.log1p(df["ConvertedCompYearly"])

df["ConvertedCompYearly_log1p"].head()

  df["ConvertedCompYearly_log1p"] = np.log1p(df["ConvertedCompYearly"])


0    0.005286
1    0.005286
2    0.005286
3    0.005286
4    0.005286
Name: ConvertedCompYearly_log1p, dtype: float64

### 7. Feature Engineering


<h5>7.1 Create a new column `ExperienceLevel` based on the `YearsCodePro` column:</h5>


In [45]:
# Replace text values with numerical equivalents
df["YearsCodePro"] = df["YearsCodePro"].replace({
    "Less than 1 year": 0,
    "More than 50 years": 51
}).astype(float)  # Convert to float to handle NaNs

# Function to assign Experience Level based on YearsCodePro
def assign_experience_level(years):
    if pd.isna(years):  # Handle missing or non-numeric values
        return 'Unknown'
    elif years <= 2:
        return 'Beginner'
    elif 3 <= years <= 5:
        return 'Intermediate'
    else:
        return 'Advanced'

# Apply the function to create the 'ExperienceLevel' column
df['ExperienceLevel'] = df['YearsCodePro'].apply(assign_experience_level)

# Print the updated DataFrame
print(df[['YearsCodePro', 'ExperienceLevel']])

       YearsCodePro ExperienceLevel
0               2.0        Beginner
1              17.0        Advanced
2              27.0        Advanced
3               2.0        Beginner
4               2.0        Beginner
...             ...             ...
65432           3.0    Intermediate
65433           2.0        Beginner
65434           5.0    Intermediate
65435           2.0        Beginner
65436           2.0        Beginner

[65437 rows x 2 columns]


  df['ExperienceLevel'] = df['YearsCodePro'].apply(assign_experience_level)
