Dataset: "SLU_Opportunity_Wise_Data" (csv file) [Raw Dataset]

Importing Necessary Libraries

In [16]:
import pandas as pd
import numpy as np

Reading the CSV file

In [17]:
file_path = "SLU_Opportunity_Wise_Data.csv"  
df = pd.read_csv(file_path)

Checking the first few rows to understand the data

In [18]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8558 entries, 0 to 8557
Data columns (total 16 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Learner SignUp DateTime  8558 non-null   object
 1   Opportunity Id           8558 non-null   object
 2   Opportunity Name         8558 non-null   object
 3   Opportunity Category     8558 non-null   object
 4   Opportunity End Date     8558 non-null   object
 5   First Name               8558 non-null   object
 6   Date of Birth            8558 non-null   object
 7   Gender                   8558 non-null   object
 8   Country                  8558 non-null   object
 9   Institution Name         8553 non-null   object
 10  Current/Intended Major   8553 non-null   object
 11  Entry created at         8558 non-null   object
 12  Status Description       8558 non-null   object
 13  Status Code              8558 non-null   int64 
 14  Apply Date               8558 non-null  

1. Handling Missing & NULL Values

Checking for NULL values

In [19]:
print("Missing values per column:\n", df.isnull().sum())

Missing values per column:
 Learner SignUp DateTime       0
Opportunity Id                0
Opportunity Name              0
Opportunity Category          0
Opportunity End Date          0
First Name                    0
Date of Birth                 0
Gender                        0
Country                       0
Institution Name              5
Current/Intended Major        5
Entry created at              0
Status Description            0
Status Code                   0
Apply Date                    0
Opportunity Start Date     3794
dtype: int64


Checking for Missing Value

In [38]:
# Count missing values in all columns
missing_values = df.isna().sum()
print("Missing values per column:\n", missing_values)

Missing values per column:
 Learner SignUp DateTime      19
Opportunity Id                0
Opportunity Name              0
Opportunity Category          0
Opportunity End Date       1216
First Name                    0
Date of Birth                 0
Gender                        0
Country                       0
Institution Name              0
Current/Intended Major        0
Entry created at              0
Status Description            0
Status Code                   0
Apply Date                  245
Opportunity Start Date      811
dtype: int64


Identifying Missing Values in apply date column

In [39]:
# Method 1: Count number of missing values
missing_apply_date = df['Apply Date'].isna().sum()
print(f"Number of missing values in Apply Date: {missing_apply_date}")

Number of missing values in Apply Date: 245


Handled Missing Values in apply date column

In [40]:
from datetime import datetime

df['Apply Date'] = df['Apply Date'].fillna(pd.Timestamp('today'))

Identifying Missing Values in Opportunity Start Date column

In [41]:
# Method 1: Count number of missing values
missing_start_date = df['Opportunity Start Date'].isna().sum()
print(f"Number of missing values in Opportunity Start Date: {missing_start_date}")

Number of missing values in Opportunity Start Date: 811


Handled Missing Values in Opportunity Start Date column

In [42]:
from datetime import datetime

df['Opportunity Start Date'] = df['Opportunity Start Date'].fillna(pd.Timestamp('today'))

Identifying Missing Values in Opportunity End Date column

In [43]:
# Method 1: Count number of missing values
missing_end_date = df['Opportunity End Date'].isna().sum()
print(f"Number of missing values in Opportunity End Date: {missing_end_date}")

Number of missing values in Opportunity End Date: 1216


Handled Missing Values in Opportunity End Date column

In [44]:
from datetime import datetime

df['Opportunity End Date'] = df['Opportunity End Date'].fillna(pd.Timestamp('today'))

Identifying Missing Values in Learner SignUp DateTime column

In [45]:
# Method 1: Count number of missing values
missing_signup_date = df['Learner SignUp DateTime'].isna().sum()
print(f"Number of missing values in Learner SignUp DateTime: {missing_signup_date}")

Number of missing values in Learner SignUp DateTime: 19


Handled Missing Values in Learner SignUp DateTime column

In [46]:
from datetime import datetime

# Fill missing values with current date
df['Learner SignUp DateTime'] = df['Learner SignUp DateTime'].fillna(pd.Timestamp('today'))

Filling missing values with appropriate defaults.
Example: filling missing Institution Name & Current/Intended Major with 'unknown' AND missing Opportunity Start Date with current date

In [21]:
df['Institution Name'] = df['Institution Name'].fillna('Unknown')
df['Current/Intended Major'] = df['Current/Intended Major'].fillna('Unknown')
df['Opportunity Start Date'] = df['Opportunity Start Date'].fillna(pd.Timestamp('today'))

Verifying missing values are handled

In [47]:
print(df.isnull().sum())

Learner SignUp DateTime    0
Opportunity Id             0
Opportunity Name           0
Opportunity Category       0
Opportunity End Date       0
First Name                 0
Date of Birth              0
Gender                     0
Country                    0
Institution Name           0
Current/Intended Major     0
Entry created at           0
Status Description         0
Status Code                0
Apply Date                 0
Opportunity Start Date     0
dtype: int64


Verifying if there remains any missing values in each columns

In [48]:
# Count missing values in all columns
missing_values = df.isna().sum()
print("Missing values per column:\n", missing_values)


Missing values per column:
 Learner SignUp DateTime    0
Opportunity Id             0
Opportunity Name           0
Opportunity Category       0
Opportunity End Date       0
First Name                 0
Date of Birth              0
Gender                     0
Country                    0
Institution Name           0
Current/Intended Major     0
Entry created at           0
Status Description         0
Status Code                0
Apply Date                 0
Opportunity Start Date     0
dtype: int64


2. Handling Outliers

Step 1: Identifying Outliers

In [23]:
# List of date columns
date_cols = ['Learner SignUp DateTime', 'Opportunity Start Date', 'Apply Date', 
             'Opportunity End Date', 'Entry created at']

# Convert to datetime
for col in date_cols:
    df[col] = pd.to_datetime(df[col], errors='coerce')  

# ------------------------------
# Identify numeric outliers (Status Code) 
# ------------------------------
Q1 = df['Status Code'].quantile(0.25)
Q3 = df['Status Code'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

status_outliers = df[(df['Status Code'] < lower_bound) | (df['Status Code'] > upper_bound)]
print("Status Code Outliers:\n", status_outliers)

# ------------------------------
# Identify date/time outliers
# ------------------------------
min_date = pd.Timestamp('2000-01-01')
max_date = pd.Timestamp('today')

for col in date_cols:
    date_outliers = df[(df[col] < min_date) | (df[col] > max_date)]
    print(f"Number of outliers in {col}: {len(date_outliers)}")
    if not date_outliers.empty:
        print(date_outliers[[col]])

Status Code Outliers:
 Empty DataFrame
Columns: [Learner SignUp DateTime, Opportunity Id, Opportunity Name, Opportunity Category, Opportunity End Date, First Name, Date of Birth, Gender, Country, Institution Name, Current/Intended Major, Entry created at, Status Description, Status Code, Apply Date, Opportunity Start Date]
Index: []
Number of outliers in Learner SignUp DateTime: 0
Number of outliers in Opportunity Start Date: 0
Number of outliers in Apply Date: 0
Number of outliers in Opportunity End Date: 355
     Opportunity End Date
5440  2025-12-05 11:06:00
5441  2025-12-05 11:06:00
5442  2025-12-05 11:06:00
5443  2025-12-05 11:06:00
5444  2025-12-05 11:06:00
...                   ...
8316  2025-12-24 03:34:00
8317  2025-12-24 03:34:00
8318  2025-12-24 03:34:00
8319  2025-12-24 03:34:00
8320  2025-12-24 03:34:00

[355 rows x 1 columns]
Number of outliers in Entry created at: 0


Step 2: Handling Outliers

In [24]:
# Handle numeric outliers: Status Code

Q1 = df['Status Code'].quantile(0.25)
Q3 = df['Status Code'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

df['Status Code'] = df['Status Code'].clip(lower=lower_bound, upper=upper_bound)

# Handle date/time outliers
date_cols = ['Learner SignUp DateTime', 'Opportunity Start Date', 'Apply Date', 
             'Opportunity End Date', 'Entry created at']

min_date = pd.Timestamp('2000-01-01')
max_date = pd.Timestamp('today')

for col in date_cols:
    df[col] = df[col].clip(lower=min_date, upper=max_date)

Verifying outliers are handled

In [25]:
# Check for numeric outliers again
status_outliers = df[(df['Status Code'] < lower_bound) | (df['Status Code'] > upper_bound)]
print("Status Code Outliers after handling:\n", status_outliers)

# Check for date outliers again
for col in date_cols:
    date_outliers = df[(df[col] < min_date) | (df[col] > max_date)]
    print(f"Number of outliers in {col} after handling: {len(date_outliers)}")

Status Code Outliers after handling:
 Empty DataFrame
Columns: [Learner SignUp DateTime, Opportunity Id, Opportunity Name, Opportunity Category, Opportunity End Date, First Name, Date of Birth, Gender, Country, Institution Name, Current/Intended Major, Entry created at, Status Description, Status Code, Apply Date, Opportunity Start Date]
Index: []
Number of outliers in Learner SignUp DateTime after handling: 0
Number of outliers in Opportunity Start Date after handling: 0
Number of outliers in Apply Date after handling: 0
Number of outliers in Opportunity End Date after handling: 0
Number of outliers in Entry created at after handling: 0


3. Standardizing formats

Checking Date/Time Columns if needs Standardize

In [26]:
# List of date columns
date_cols = ['Learner SignUp DateTime', 'Opportunity Start Date', 'Apply Date', 
             'Opportunity End Date', 'Entry created at']

# Check current formats
for col in date_cols:
    print(f"Column: {col}")
    print(df[col].head())
    print(df[col].dtype)
    print("-"*50)


Column: Learner SignUp DateTime
0   2023-06-14 12:30:35
1   2023-05-01 05:29:16
2   2023-04-09 20:35:08
3   2023-08-29 05:20:03
4   2023-01-06 15:26:36
Name: Learner SignUp DateTime, dtype: datetime64[ns]
datetime64[ns]
--------------------------------------------------
Column: Opportunity Start Date
0   2022-11-03 18:30:39
1   2022-11-03 18:30:39
2   2022-11-03 18:30:39
3   2022-11-03 18:30:39
4   2022-11-03 18:30:39
Name: Opportunity Start Date, dtype: datetime64[ns]
datetime64[ns]
--------------------------------------------------
Column: Apply Date
0   2023-06-14 12:36:09
1   2023-05-01 06:08:21
2                   NaT
3   2023-10-09 22:02:42
4   2023-01-06 15:40:10
Name: Apply Date, dtype: datetime64[ns]
datetime64[ns]
--------------------------------------------------
Column: Opportunity End Date
0   2024-06-29 18:52:39
1   2024-06-29 18:52:39
2   2024-06-29 18:52:39
3   2024-06-29 18:52:39
4   2024-06-29 18:52:39
Name: Opportunity End Date, dtype: datetime64[ns]
datetime64[ns]
-

The date columns in the dataset, including Learner SignUp DateTime, Opportunity Start Date, Apply Date, Opportunity End Date, and Entry Created At, are already in a consistent datetime64 format. This ensures uniformity across all records, making date-based analysis and comparisons straightforward.

Checking Categorical/Text Columns if needs Standardize

In [27]:
# List of categorical columns
cat_cols = ['Opportunity Name', 'Opportunity Category', 'First Name', 
            'Gender', 'Country', 'Institution Name', 'Current/Intended Major', 
            'Status Description']

# Inspect unique values and sample data
for col in cat_cols:
    print(f"Column: {col}")
    print("Unique values (sample 10):", df[col].unique()[:10])
    print(df[col].head())
    print("-"*50)

Column: Opportunity Name
Unique values (sample 10): ['Career Essentials: Getting Started with Your Professional Journey'
 'Slide Geeks: A Presentation Design Competition' 'Digital Marketing'
 'Health Care Management' 'Innovation & Entrepreneurship'
 'Project Management' 'Data Visualization' 'CPR/AED Certification'
 'Mental and Physical Health Session'
 'Jump Start: Developing your Emotional Intelligence']
0    Career Essentials: Getting Started with Your P...
1    Career Essentials: Getting Started with Your P...
2    Career Essentials: Getting Started with Your P...
3    Career Essentials: Getting Started with Your P...
4    Career Essentials: Getting Started with Your P...
Name: Opportunity Name, dtype: object
--------------------------------------------------
Column: Opportunity Category
Unique values (sample 10): ['Course' 'Competition' 'Internship' 'Event' 'Engagement']
0    Course
1    Course
2    Course
3    Course
4    Course
Name: Opportunity Category, dtype: object
----------

The categorical columns in the dataset, including Opportunity Name, Opportunity Category, First Name, Gender, Country, Institution Name, Current/Intended Major, and Status Description, show minor inconsistencies in capitalization and spacing. Standardizing these columns—by stripping extra spaces, applying title case, and harmonizing values like Gender—ensures uniformity across all records. This improves readability, prevents duplicate entries during analysis, and supports accurate grouping, filtering, and reporting.

Based on this output, the columns that need standardization are mainly categorical/text columns because:

First Name – some names are fully uppercase (SIDDHARTH) while others are proper case.

Gender – values like "Don't want to specify" should be consistent with others.

Country – might have inconsistent capitalization or extra spaces.

Institution Name – inconsistent capitalization (SAINT LOUIS vs federal university Lokoja).

Current/Intended Major – capitalization differences (Computer Science vs computer science).

Opportunity Name / Opportunity Category / Status Description – mainly need stripping spaces and consistent capitalization.

Standardizing formats of All Categorical Columns

In [28]:
# List of categorical columns to standardize
cat_cols = ['Opportunity Name', 'Opportunity Category', 'First Name', 
            'Gender', 'Country', 'Institution Name', 'Current/Intended Major', 
            'Status Description']

# Standardize text: strip spaces and convert to title case
for col in cat_cols:
    df[col] = df[col].astype(str).str.strip().str.title()

# Standardize Gender values explicitly
df['Gender'] = df['Gender'].replace({
    'M': 'Male',
    'F': 'Female',
    "Don't Want To Specify": "Prefer Not To Say",
    "Other": "Other"
})

# checking first few rows after standardization
df[cat_cols].head()


Unnamed: 0,Opportunity Name,Opportunity Category,First Name,Gender,Country,Institution Name,Current/Intended Major,Status Description
0,Career Essentials: Getting Started With Your P...,Course,Faria,Female,Pakistan,Nwihs,Radiology,Started
1,Career Essentials: Getting Started With Your P...,Course,Poojitha,Female,India,Saint Louis,Information Systems,Started
2,Career Essentials: Getting Started With Your P...,Course,Emmanuel,Male,United States,Illinois Institute Of Technology,Computer Science,Started
3,Career Essentials: Getting Started With Your P...,Course,Amrutha Varshini,Female,United States,Saint Louis University,Information Systems,Team Allocated
4,Career Essentials: Getting Started With Your P...,Course,Vinay Varshith,Male,United States,Saint Louis University,Computer Science,Started


Checking Numeric Columns if needs Standardize

In [29]:
# Numeric columns
num_cols = ['Status Code']

for col in num_cols:
    print(f"Column: {col}")
    print(df[col].describe())
    print(df[col].dtype)
    print("-"*50)

Column: Status Code
count    8558.000000
mean     1052.225987
std        21.665207
min      1010.000000
25%      1030.000000
50%      1050.000000
75%      1070.000000
max      1120.000000
Name: Status Code, dtype: float64
int64
--------------------------------------------------


The Status Code column in the dataset is already clean and consistent. All values are of integer type (int64) and fall within a reasonable range, with no extreme outliers or irregularities. As a result, no further standardization is required, and the column is ready for analysis or any downstream processing.

4. Correcting Errors

The dataset contains occasional typographical errors, inconsistencies, and inaccurate entries that can affect analysis. For example, institutions like “st. louis” and “Saint Louis-” refer to the same entity but appear differently, and some users have entered invalid majors such as xxxhhyy. Correcting these errors involves standardizing institution names, harmonizing categorical entries like Gender, and flagging or replacing invalid or unknown values in majors. These corrections ensure data consistency, reduce inaccuracies, and improve the reliability of analysis without altering valid information.

In [30]:
# Example: Correct Institution Name variations
institution_mapping = {
    'Saint Louis': 'Saint Louis University',
    'St Louis University': 'Saint Louis University',
    'St. Louis University': 'Saint Louis University',
    'ST LOUIS UNIVERSITY': 'Saint Louis University'
}
df['Institution Name'] = df['Institution Name'].replace(institution_mapping)

# Correct Gender inconsistencies
df['Gender'] = df['Gender'].replace({
    'M': 'Male',
    'F': 'Female',
    "Don't Want To Specify": "Prefer Not To Say",
    "Other": "Other"
})

Correcting short forms into full forms of Institution name column's Values based on country

In [33]:
# Dictionary for Institution Name corrections
institution_corrections = {
    # Afghanistan
    "Gndu": "Guru Nanak Dev University",
    
    # Azerbaijan
    "Ashoka": "Ashoka University",
    
    # Bangladesh
    "Aust": "Ahsanullah University of Science and Technology",
    "Cpscr": "Chattogram Port School and College",

    # British Indian Ocean Territory
    "Asdads": "Unknown Institution",

    # China
    "长沙学院": "Changsha University",

    # Egypt
    "Bue": "The British University in Egypt",
    "Must": "Misr University for Science and Technology",
    "Habiba": "Habiba Community School",

    # Ghana
    "Upsa": "University of Professional Studies, Accra",
    "Knust": "Kwame Nkrumah University of Science and Technology",

    # India
    "Jntuh": "Jawaharlal Nehru Technological University Hyderabad",
    "Gitam": "Gandhi Institute of Technology and Management",
    "Nift": "National Institute of Fashion Technology",
    "Ignou": "Indira Gandhi National Open University",
    "Lpu": "Lovely Professional University",
    "Vtu": "Visvesvaraya Technological University",
    "Nituk": "National Institute of Technology Uttarakhand",
    "Psit": "Pranveer Singh Institute of Technology",
    "Srm": "SRM Institute of Science and Technology",
    "Nit": "National Institute of Technology",
    "Msu": "Maharaja Sayajirao University of Baroda",
    "IIT Delhi": "Indian Institute of Technology Delhi",
    "IIT Ism Dhanbad": "Indian Institute of Technology (ISM) Dhanbad",
    "NIT Durgapur": "National Institute of Technology Durgapur",
    "Nit-Agartala": "National Institute of Technology Agartala",
    "Bits Pilani Hyderabad Campus": "Birla Institute of Technology and Science, Pilani - Hyderabad Campus",
    "VIT University": "Vellore Institute of Technology",
    "Vit Ap University": "Vellore Institute of Technology - Andhra Pradesh",
    "SRM University": "SRM Institute of Science and Technology",
    "JNTU Kakinada": "Jawaharlal Nehru Technological University Kakinada",
    "Jntuhceh": "Jawaharlal Nehru Technological University Hyderabad",
    "IIIT Rk Valley": "Indian Institute of Information Technology RK Valley",
    "IIM Kozhikode": "Indian Institute of Management Kozhikode",
    "IIM Nagpur": "Indian Institute of Management Nagpur",
    "Jntuh": "Jawaharlal Nehru Technological University Hyderabad",
    "Jntuhceh": "Jawaharlal Nehru Technological University Hyderabad",
    "Jntu Kakinada": "Jawaharlal Nehru Technological University Kakinada",
    "Jntuacek": "Jawaharlal Nehru Technological University Anantapur",
    "Jntuk": "Jawaharlal Nehru Technological University Kakinada",
    "Gitam": "Gandhi Institute of Technology and Management",
    "Nift": "National Institute of Fashion Technology",
    "Ignou": "Indira Gandhi National Open University",
    "Lpu": "Lovely Professional University",
    "Vtu": "Visvesvaraya Technological University",
    "Nituk": "National Institute of Technology Uttarakhand",
    "Nit Durgapur": "National Institute of Technology Durgapur",
    "Nit-Agartala": "National Institute of Technology Agartala",
    "Nit": "National Institute of Technology",
    "Vnit": "Visvesvaraya National Institute of Technology",
    "Psit": "Pranveer Singh Institute of Technology",
    "Srm": "SRM Institute of Science and Technology",
    "Srm University": "SRM Institute of Science and Technology",
    "Srm Ist": "SRM Institute of Science and Technology",
    "Srmlist": "SRM Institute of Science and Technology",
    "Vit University": "Vellore Institute of Technology",
    "Vit Ap University": "Vellore Institute of Technology - Andhra Pradesh",
    "Vit Chennai": "Vellore Institute of Technology - Chennai",
    "Vit Icer": "Vellore Institute of Technology - ICER",
    "Bits Pilani Hyderabad Campus": "Birla Institute of Technology and Science, Pilani - Hyderabad Campus",
    "Iit Delhi": "Indian Institute of Technology Delhi",
    "IIT Delhi": "Indian Institute of Technology Delhi",
    "IIT Ism Dhanbad": "Indian Institute of Technology (ISM) Dhanbad",
    "IiIt Rk Valley": "Indian Institute of Information Technology RK Valley",
    "Iim Kozhikode": "Indian Institute of Management Kozhikode",
    "Iim Nagpur": "Indian Institute of Management Nagpur",
    "Msu": "Maharaja Sayajirao University of Baroda",
    "M. G University": "Mahatma Gandhi University",
    "Mjcet": "Muffakham Jah College of Engineering and Technology",
    "Gndit": "Guru Nanak Dev Institute of Technology",
    "Sju": "St. Joseph’s University",
    "Sies": "SIES College of Arts, Science & Commerce",
    "Sscbs": "Shaheed Sukhdev College of Business Studies",
    "Srcc": "Shri Ram College of Commerce",
    "Ramjas": "Ramjas College, University of Delhi",
    "Csjmu": "Chhatrapati Shahu Ji Maharaj University",
    "Rtmnu": "Rashtrasant Tukadoji Maharaj Nagpur University",
    "Hnbgu": "Hemvati Nandan Bahuguna Garhwal University",
    "Msu": "Maharaja Sayajirao University of Baroda",
    "Excelerate": "Excelerate Institute",
    "Iter": "Institute of Technical Education and Research",
    "Rvr&Jc": "R.V.R & J.C. College of Engineering",
    "Vnrvjiet": "VNR Vignana Jyothi Institute of Engineering & Technology",
    "Dps Ruby Park": "Delhi Public School Ruby Park",
    "Oakridge": "Oakridge International School",
    "Vibgyor": "Vibgyor Group of Schools",
    "Vibgyor High": "Vibgyor Group of Schools",
    "Jiet": "Jodhpur Institute of Engineering and Technology",
    "Lnit Srinagar": "National Institute of Technology Srinagar",
    "Tjit": "Thakur College of Engineering and Technology",
    "Cmr": "CMR Institute of Technology",
    "Cmr Technical Campus": "CMR Technical Campus",
    "Miritm": "Maturi Institute of Technology and Management",
    "Maac": "Maya Academy of Advanced Cinematics",
    "Ki University": "Kalinga Institute of Industrial Technology University",
    "K L University": "KL University",
    "Mit-Wpu": "MIT World Peace University",
    "Mit Wpu": "MIT World Peace University",
    "Mit - Wpu": "MIT World Peace University",
    "Mgit": "Mahatma Gandhi Institute of Technology",
    "Svit Vasad": "Sardar Vallabhbhai Patel Institute of Technology, Vasad",
    "Ljims": "L J Institute of Management Studies",
    "Fiem": "Future Institute of Engineering and Management",
    "Icfai University": "ICFAI University",
    "Vpkbiet": "Vidya Pratishthan’s Kamalnayan Bajaj Institute of Engineering and Technology",
    "Isb&M Pune": "International School of Business & Media, Pune",
    "Ips College": "IPS College of Technology & Management",
    "Ips Dehradun": "Institute of Professional Studies, Dehradun",
    "R V C E": "Rashtreeya Vidyalaya College of Engineering",
    "Svsps": "Sri Venkateswara Swamy Polytechnic",
    "Svnit": "Sardar Vallabhbhai National Institute of Technology",
    "Aec": "Aditya Engineering College",
    "Ihrd": "Institute of Human Resources Development",
    "Itm Sls University": "ITM SLS Baroda University",
    "Kr Mangalam": "K.R. Mangalam University",
    "Vignan": "Vignan’s Foundation for Science, Technology & Research",
    "Dseu": "Delhi Skill and Entrepreneurship University",
    "Pccoer": "Pimpri Chinchwad College of Engineering and Research",
    "P. V. G. Nashik": "Pune Vidyarthi Griha’s College of Engineering and Technology, Nashik",
    "P.V.G Nashik": "Pune Vidyarthi Griha’s College of Engineering and Technology, Nashik",
    "Biet": "Bapuji Institute of Engineering and Technology",
    "Sdes": "School of Design and Engineering Studies",
    "Ssit": "Sri Siddhartha Institute of Technology",
    "Mvj College of Engineering": "MVJ College of Engineering",
    "Gec Rajkot": "Government Engineering College Rajkot",
    "Dvr &D.Hs Mic College Of Technology": "DVR & DHS MIC College of Technology",
    "Upgrad": "UpGrad Learning Institute",
    "Bennet": "Bennett University",
    "Helpline": "Invalid University",
    "Afsm": "Invalid University",
    "Outr": "Invalid University",
    "Cdwcnwj": "Invalid University",
    "AsdfVvit": "Invalid University",
    "Asdf": "Invalid University",
    "Abc": "Invalid University",
    "Nil": "Invalid University",
    "Test": "Invalid University",
    "Ind": "Invalid University",
    "Sn": "Invalid University",
    "I": "Invalid University",
    "M.Tech": "Invalid University",
    "Popcorntime.Telugu@Gmail.Com": "Invalid University",

    # Iran
    "Gds": "Graduate School of Decision Sciences",

    # Kenya
    "Jkuat": "Jomo Kenyatta University of Agriculture and Technology",

    # Lebanon
    "Lau": "Lebanese American University",

    # Malaysia
    "Utem": "Universiti Teknikal Malaysia Melaka",

    # Nigeria
    "Lasu": "Lagos State University",
    "Oauthc": "Obafemi Awolowo University Teaching Hospitals Complex",
    "Niit": "National Institute of Information Technology",
    "Fuoye": "Federal University Oye-Ekiti",
    "Uniben": "University of Benin",

    # Pakistan
    "Nwihs": "Northwest Institute of Health Sciences",
    "Nust": "National University of Sciences and Technology",
    "Fuuast": "Federal Urdu University of Arts, Science and Technology",
    "Ned": "NED University of Engineering and Technology",
    "Muet": "Mehran University of Engineering and Technology",
    "Iub": "Islamia University of Bahawalpur",
    "Iobm": "Institute of Business Management",
    "Uit": "Usman Institute of Technology",
    "Lcwu": "Lahore College for Women University",
    "Gcuf": "Government College University Faisalabad",
    "Pmas": "Pir Mehr Ali Shah Arid Agriculture University",

    # Philippines
    "Joshua": "Joshua Christian Academy",

    # Rwanda
    "Unilak": "University of Lay Adventists of Kigali",
    "Nega": "New Generation Academy",

    # South Africa
    "Umuzi": "Umuzi Academy",

    # UAE
    "Excelr": "ExcelR Training Institute",

    # United States
    "Pgcc": "Prince George's Community College",
    "Slu": "Saint Louis University",
    "Jntu": "Jawaharlal Nehru Technological University",
    "Qq": "Unknown Institution"
}

# Replace short forms with full names
df['Institution Name'] = df['Institution Name'].replace(institution_corrections)

# Verify corrections
print(df['Institution Name'].unique())


['Northwest Institute of Health Sciences' 'Saint Louis University'
 'Illinois Institute Of Technology' ... 'Metea Valley High School'
 'Dhanalakshmi Srinivasan Engineering College Perambalur'
 'Jawaharlal Nehru Technological University Of Hyderabad']


During the data cleaning process, several errors were found in the Institution Name column. Many institutions, particularly from India, were recorded in short forms or abbreviations (e.g., IIT Delhi, NIT Durgapur, JNTU), which were expanded into their respective full names to maintain consistency and accuracy. Additionally, a number of entries contained invalid or non-academic values (such as random text, test inputs, or email addresses). These invalid institution names were standardized and replaced with “Unknown University” or “Invalid University” to avoid ambiguity and preserve data quality.

5. Dealing With Duplicates

Identifying Duplicates

In [34]:
# Check for duplicates across all columns
duplicate_rows = df[df.duplicated()]
print(f"Number of duplicate rows: {len(duplicate_rows)}")

# Check for duplicates based on key identifiers only
duplicate_keys = df[df.duplicated(subset=['Opportunity Id', 'Learner SignUp DateTime'])]
print(f"Number of duplicate key records: {len(duplicate_keys)}")

Number of duplicate rows: 0
Number of duplicate key records: 312


The dataset contains some duplicate records based on the combination of Opportunity Id and Learner SignUp DateTime, indicating that certain learners were recorded multiple times for the same opportunity. To maintain data integrity and avoid redundancy, these duplicate key records should be removed while keeping a single occurrence of each unique learner-opportunity combination. This ensures accurate counts, prevents skewed analysis, and preserves the reliability of the dataset for further processing.

Removing Duplicates

In [35]:
# Remove duplicates based on key identifiers
df = df.drop_duplicates(subset=['Opportunity Id', 'Learner SignUp DateTime'], keep='first')

# Verify
print(f"Remaining records after removing duplicates: {len(df)}")


Remaining records after removing duplicates: 8246


6. Handling Inconsistent Categorical Data

Identifying Inconsistencies

In [None]:
# Full list of categorical/text columns
cat_cols = ['Opportunity Name', 'Opportunity Category', 'First Name', 
            'Gender', 'Country', 'Institution Name', 'Current/Intended Major', 
            'Status Description']

# Inspect unique values for inconsistencies
for col in cat_cols:
    print(f"{col} unique values:")
    print(df[col].unique())
    print("-"*50)

Handling Inconsistent

In [37]:
# List of categorical columns
cat_cols = ['Opportunity Name', 'Opportunity Category', 'First Name', 
            'Gender', 'Country', 'Institution Name', 'Current/Intended Major', 
            'Status Description']

# Step 1: Strip spaces and convert all text to title case
for col in cat_cols:
    df[col] = df[col].astype(str).str.strip().str.title()

# Step 2: Standardize Gender
df['Gender'] = df['Gender'].replace({
    "Don'T Want To Specify": "Prefer Not To Say",
    'M': 'Male',
    'F': 'Female',
    'Other': 'Other'
})

# Step 3: Standardize Institution Names (add other known variations if needed)
institution_mapping = {
    'Saint Louis': 'Saint Louis University',
    'St Louis University': 'Saint Louis University',
    'St. Louis University': 'Saint Louis University',
    'St Louis': 'Saint Louis University',
    'St. Louis': 'Saint Louis University'
}
df['Institution Name'] = df['Institution Name'].replace(institution_mapping)

# Step 4: Standardize Country names (examples, expand as needed)
country_mapping = {
    'Tanzania, United Republic Of Tanzania': 'Tanzania',
    'Iran, Islamic Republic Of Persian Gulf': 'Iran',
    'Iran  Islamic Republic Of Persian Gulf': 'Iran',
    'Korea, Republic Of South Korea': 'South Korea',
    "Cote D'Ivoire": "Cote D'Ivoire"
}
df['Country'] = df['Country'].replace(country_mapping)

# Step 5: Verify unique values after standardization
for col in cat_cols:
    print(f"{col} unique values after standardization:")
    print(df[col].unique())
    print("-"*50)


Opportunity Name unique values after standardization:
['Career Essentials: Getting Started With Your Professional Journey'
 'Slide Geeks: A Presentation Design Competition' 'Digital Marketing'
 'Health Care Management' 'Innovation & Entrepreneurship'
 'Project Management' 'Data Visualization' 'Cpr/Aed Certification'
 'Mental And Physical Health Session'
 'Jump Start: Developing Your Emotional Intelligence'
 'Join A Student Organisation' 'Upload Your First Year Transcript'
 'Startup Mastery Workshop' 'Ai Ethics Challenge'
 'Data Visualization Associate' 'Digital Strategy Virtual Internship'
 'Project Management Associate' 'Business Consulting'
 'Urbanrenew Challenge' 'Ux Redesign Challenge'
 'Xperience Design Hackathon' 'Freelance Mastery Workshop']
--------------------------------------------------
Opportunity Category unique values after standardization:
['Course' 'Competition' 'Internship' 'Event' 'Engagement']
--------------------------------------------------
First Name unique valu

In the dataset, several categorical columns contained inconsistent entries due to variations in spelling, capitalization, or formatting. For example, gender entries included variations like "Don'T Want To Specify", country names had verbose or duplicate representations such as "Tanzania, United Republic Of Tanzania", and institution names had multiple versions of the same university like "St. Louis University" and "ST LOUIS UNIVERSITY". To address these inconsistencies, we standardized all text by stripping extra spaces, converting to title case, and mapping known variations to a single consistent value. After this process, all categorical columns such as Gender, Country, Institution Name, Opportunity Name, and Majors were uniform, ensuring accurate grouping, filtering, and analysis.

Exporting Cleaned Dataset After Data_Cleaning Process

In [49]:
# Export the cleaned dataset to CSV
df.to_csv("Cleaned_Preprocessed_Dataset_Week1.csv", index=False)

# Finally The Dataset is Cleaned ! 

Total Rows and columns in Cleaned Dataset

In [50]:
df_cleaned = pd.read_csv("Cleaned_Preprocessed_Dataset_Week1.csv")

# Total rows and columns
total_rows, total_cols = df_cleaned.shape
print(f"Total Rows: {total_rows}")
print(f"Total Columns: {total_cols}")

Total Rows: 8246
Total Columns: 16


Viewing first few rows of the cleaned Dataset

In [51]:
from IPython.display import display
display(df.head(5))

Unnamed: 0,Learner SignUp DateTime,Opportunity Id,Opportunity Name,Opportunity Category,Opportunity End Date,First Name,Date of Birth,Gender,Country,Institution Name,Current/Intended Major,Entry created at,Status Description,Status Code,Apply Date,Opportunity Start Date
0,2023-06-14 12:30:35,00000000-0GN2-A0AY-7XK8-C5FZPP,Career Essentials: Getting Started With Your P...,Course,2024-06-29 18:52:39,Faria,01/12/2001,Female,Pakistan,Northwest Institute Of Health Sciences,Radiology,2024-03-11 12:01:41,Started,1080,2023-06-14 12:36:09.000000,2022-11-03 18:30:39
1,2023-05-01 05:29:16,00000000-0GN2-A0AY-7XK8-C5FZPP,Career Essentials: Getting Started With Your P...,Course,2024-06-29 18:52:39,Poojitha,08/16/2000,Female,India,Saint Louis University,Information Systems,2024-03-11 12:01:41,Started,1080,2023-05-01 06:08:21.000000,2022-11-03 18:30:39
2,2023-04-09 20:35:08,00000000-0GN2-A0AY-7XK8-C5FZPP,Career Essentials: Getting Started With Your P...,Course,2024-06-29 18:52:39,Emmanuel,01/27/2002,Male,United States,Illinois Institute Of Technology,Computer Science,2024-03-11 12:01:41,Started,1080,2025-09-20 14:08:56.929364,2022-11-03 18:30:39
3,2023-08-29 05:20:03,00000000-0GN2-A0AY-7XK8-C5FZPP,Career Essentials: Getting Started With Your P...,Course,2024-06-29 18:52:39,Amrutha Varshini,11/01/1999,Female,United States,Saint Louis University,Information Systems,2024-03-11 12:01:41,Team Allocated,1070,2023-10-09 22:02:42.000000,2022-11-03 18:30:39
4,2023-01-06 15:26:36,00000000-0GN2-A0AY-7XK8-C5FZPP,Career Essentials: Getting Started With Your P...,Course,2024-06-29 18:52:39,Vinay Varshith,04/19/2000,Male,United States,Saint Louis University,Computer Science,2024-03-11 12:01:41,Started,1080,2023-01-06 15:40:10.000000,2022-11-03 18:30:39
