Assigned Dataset: "SLU_Opportunity_Wise_Data" (csv file) [Raw Dataset]


After Data Cleaning the Dataset was exported as : "Cleaned_Preprocessed_Dataset" (csv file)


After applying features on Cleaned Dataset, was renamed as : "Featured Dataset" (csv file)

# Verification of "Cleaned_Preprocessed_Dataset.csv"

Importing Necessary Libraries

In [1]:
import pandas as pd
import numpy as np
import re
import missingno as msno
from sklearn.impute import SimpleImputer

Reading the CSV file

In [3]:
file_path = "Cleaned_Preprocessed_Dataset.csv"  
df = pd.read_csv(file_path)

Verifying Missing Values in each column

In [4]:
# Quick overview in one DataFrame
missing_report = pd.DataFrame({
    "Missing Values": df.isnull().sum(),
    "Percentage (%)": (df.isnull().sum() / len(df)) * 100
})
print(missing_report)

                         Missing Values  Percentage (%)
Learner SignUp DateTime              19        0.230415
Opportunity Id                        0        0.000000
Opportunity Name                      0        0.000000
Opportunity Category                  0        0.000000
Opportunity End Date               1216       14.746544
First Name                            0        0.000000
Date of Birth                         0        0.000000
Gender                                0        0.000000
Country                               0        0.000000
Institution Name                      5        0.060635
Current/Intended Major                5        0.060635
Entry created at                      0        0.000000
Status Description                    0        0.000000
Status Code                           0        0.000000
Apply Date                            0        0.000000
Opportunity Start Date              811        9.835072


Handling Missing Values

In [6]:
# Convert Learner SignUp DateTime to datetime
df['Learner SignUp DateTime'] = pd.to_datetime(df['Learner SignUp DateTime'])

# Fill missing Learner SignUp DateTime with the most frequent date
most_frequent_signup = df['Learner SignUp DateTime'].mode()[0]
df['Learner SignUp DateTime'] = df['Learner SignUp DateTime'].fillna(most_frequent_signup)

# Fill Institution Name and Major
df['Institution Name'] = df['Institution Name'].fillna('Not Applicable')
df['Current/Intended Major'] = df['Current/Intended Major'].fillna('Other')

# Convert Opportunity dates to datetime
df['Opportunity Start Date'] = pd.to_datetime(df['Opportunity Start Date'])
df['Opportunity End Date'] = pd.to_datetime(df['Opportunity End Date'])
df['Apply Date'] = pd.to_datetime(df['Apply Date'])
df['Entry created at'] = pd.to_datetime(df['Entry created at'])

# Fill missing Opportunity Start Date and End Date with the most frequent date
most_frequent_start = df['Opportunity Start Date'].mode()[0]
most_frequent_end = df['Opportunity End Date'].mode()[0]

df['Opportunity Start Date'] = df['Opportunity Start Date'].fillna(most_frequent_start)
df['Opportunity End Date'] = df['Opportunity End Date'].fillna(most_frequent_end)

# Format all dates in desired style with seconds
for col in ['Learner SignUp DateTime', 'Opportunity Start Date', 'Opportunity End Date', 'Apply Date', 'Entry created at']:
    df[col] = pd.to_datetime(df[col])
    df[col] = df[col].dt.strftime('%-m/%-d/%Y %I:%M:%S %p')

checking if remains the missing values of those columns

In [7]:
# List of columns you want to check
columns_to_check = ['Learner SignUp DateTime', 'Opportunity End Date', 
                    'Institution Name', 'Current/Intended Major', 'Opportunity Start Date']

# Check missing values in these columns
missing_summary = df[columns_to_check].isnull().sum()
missing_percentage = (df[columns_to_check].isnull().sum() / len(df)) * 100

# Combine into a single DataFrame for easy viewing
missing_df = pd.DataFrame({'Missing Values': missing_summary, 'Percentage (%)': missing_percentage})
print(missing_df)


                         Missing Values  Percentage (%)
Learner SignUp DateTime               0             0.0
Opportunity End Date                  0             0.0
Institution Name                      0             0.0
Current/Intended Major                0             0.0
Opportunity Start Date                0             0.0


Note: In the dataset, missing values were carefully managed to ensure completeness and consistency. For the Learner SignUp DateTime column, missing entries were filled with the most frequent date present in the dataset, preserving the common pattern of learner sign-ups. The Institution Name and Current/Intended Major columns, which had only a few missing values, were filled with 'Not Applicable' and 'Other' respectively to maintain clarity. For date-related columns, Opportunity Start Date missing values were replaced with the median start date for each corresponding Opportunity Name, while missing Opportunity End Date values were filled by adding the median duration of the opportunity to the start date. All dates were kept in the original format (M/D/YYYY H:MM:SS AM/PM), ensuring that the dataset remained consistent, complete, and ready for subsequent analysis.

Verifying Country column's Values- Name Correction 

In [8]:
# Unique country names
print(df['Country'].unique())

# Number of unique countries
print("Total unique countries:", df['Country'].nunique())

# Frequency count of each country
print(df['Country'].value_counts())

['Pakistan' 'India' 'United States' 'United Arab Emirates' 'Nigeria'
 'Egypt' 'Nepal' 'Kenya' 'Ghana' 'Zambia' 'Morocco' 'Ethiopia' 'Zimbabwe'
 'Uganda' 'Indonesia' 'Cameroon' 'Yemen' 'China' 'Bangladesh' 'Congo'
 'Liberia' 'United Kingdom' 'Vietnam' 'Japan' 'Rwanda' 'Gambia'
 'Philippines' 'Australia' 'Somalia' 'Sierra Leone' 'Lebanon' 'Botswana'
 'Iraq' 'Uzbekistan' 'Turkey' 'Honduras'
 'Tanzania, United Republic Of Tanzania' 'British Indian Ocean Territory'
 'France' 'Belarus' 'Algeria' 'Korea, Republic Of South Korea' 'Mauritius'
 'Tunisia' 'Kazakhstan' 'Peru' 'Brazil' 'Ukraine' 'South Africa' 'Namibia'
 'Iran, Islamic Republic Of Persian Gulf' 'American Samoa'
 'Falkland Islands (Malvinas)' 'Saudi Arabia' 'Azerbaijan'
 'Dominican Republic' 'Lesotho' 'Malaysia' 'Virgin Islands, U.S.' 'Qatar'
 'Germany' 'Canada' 'Singapore' 'Iran  Islamic Republic Of Persian Gulf'
 'Ireland' 'Libyan Arab Jamahiriya' "Cote D'Ivoire" 'Afghanistan' 'Bhutan'
 'Spain']
Total unique countries: 70
Country


In [10]:
# Dictionary to fix inconsistent country names
country_mapping = {
    "Tanzania, United Republic Of Tanzania": "Tanzania",
    "Korea, Republic Of South Korea": "South Korea",
    "Iran, Islamic Republic Of Persian Gulf": "Iran",
    "Iran  Islamic Republic Of Persian Gulf": "Iran",
    "Libyan Arab Jamahiriya": "Libya",
    "Cote D'Ivoire": "Ivory Coast",
    "British Indian Ocean Territory": "British Indian Ocean Terr.",
    "Virgin Islands, U.S.": "U.S. Virgin Islands",
    "Falkland Islands (Malvinas)": "Falkland Islands"
}

# Apply mapping
df['Country'] = df['Country'].replace(country_mapping)

# Get cleaned unique country list
cleaned_countries = sorted(df['Country'].unique())
print(cleaned_countries)
print("Total unique countries after cleaning:", df['Country'].nunique())


['Afghanistan', 'Algeria', 'American Samoa', 'Australia', 'Azerbaijan', 'Bangladesh', 'Belarus', 'Bhutan', 'Botswana', 'Brazil', 'British Indian Ocean Terr.', 'Cameroon', 'Canada', 'China', 'Congo', 'Dominican Republic', 'Egypt', 'Ethiopia', 'Falkland Islands', 'France', 'Gambia', 'Germany', 'Ghana', 'Honduras', 'India', 'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Ivory Coast', 'Japan', 'Kazakhstan', 'Kenya', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Malaysia', 'Mauritius', 'Morocco', 'Namibia', 'Nepal', 'Nigeria', 'Pakistan', 'Peru', 'Philippines', 'Qatar', 'Rwanda', 'Saudi Arabia', 'Sierra Leone', 'Singapore', 'Somalia', 'South Africa', 'South Korea', 'Spain', 'Tanzania', 'Tunisia', 'Turkey', 'U.S. Virgin Islands', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'Uzbekistan', 'Vietnam', 'Yemen', 'Zambia', 'Zimbabwe']
Total unique countries after cleaning: 69


In [11]:
# Total unique countries
print("Total unique countries after cleaning:", df['Country'].nunique())

# Count of rows per country
country_counts = df['Country'].value_counts()
print("Country-wise count:")
print(country_counts)

Total unique countries after cleaning: 69
Country-wise count:
Country
United States    3774
India            2785
Nigeria           736
Ghana             256
Pakistan          218
                 ... 
Ireland             1
Libya               1
Afghanistan         1
Bhutan              1
Spain               1
Name: count, Length: 69, dtype: int64


In [12]:
# Ensure all rows are shown
pd.set_option('display.max_rows', None)

# Country-wise count
country_counts = df['Country'].value_counts()
print("Country-wise count:")
print(country_counts)

Country-wise count:
Country
United States                 3774
India                         2785
Nigeria                        736
Ghana                          256
Pakistan                       218
Bangladesh                      59
Egypt                           49
Ethiopia                        38
Kenya                           37
Rwanda                          28
Nepal                           20
Turkey                          17
United Kingdom                  17
Philippines                     13
Uganda                          12
Vietnam                         11
Sierra Leone                    11
United Arab Emirates            10
South Korea                      9
Iran                             9
Zimbabwe                         8
Gambia                           8
Zambia                           8
China                            8
Tanzania                         7
South Africa                     7
Indonesia                        6
Morocco                    

Note: We reviewed the Country column of the dataset and initially found 70 unique entries, some of which were inconsistent due to variations in naming formats, duplicate representations, and outdated terms. Examples included entries such as “Iran, Islamic Republic Of Persian Gulf” and “Iran Islamic Republic Of Persian Gulf” which were both standardized to “Iran”, “Korea, Republic Of South Korea” simplified to “South Korea”, “Tanzania, United Republic Of Tanzania” corrected to “Tanzania”, and “Libyan Arab Jamahiriya” updated to “Libya”. After applying a systematic mapping and replacement process, the dataset was cleaned to a consistent set of 69 unique countries. This ensures uniformity in analysis and avoids duplication or fragmentation in country-based insights.

Verifying Institution column's Values- Name Correction 

In [13]:
# Unique institution names
print(df['Institution Name'].unique())

# Number of unique institutions
print("Total unique institutions:", df['Institution Name'].nunique())

# Frequency count of each institution
print(df['Institution Name'].value_counts())

['Northwest Institute Of Health Sciences' 'Saint Louis University'
 'Illinois Institute Of Technology' ... 'Metea Valley High School'
 'Dhanalakshmi Srinivasan Engineering College Perambalur'
 'Jawaharlal Nehru Technological University Of Hyderabad']
Total unique institutions: 1677
Institution Name
Saint Louis University                                                                                        3487
Not Applicable                                                                                                 777
Illinois Institute Of Technology                                                                               151
Webster University                                                                                              61
Kwame Nkrumah University Of Science And Technology                                                              47
Srm University                                                                                                  31
University

Note: The Institution Name column contains 1677 unique entries, encompassing universities, colleges, high schools, and a few entries marked as Not Applicable (772). The most frequent institution is Saint Louis University (3487), followed by Illinois Institute Of Technology (151) and Webster University (61). Unlike other columns, such as Country, the institution names are already distinct and consistent, so no standardization or correction is needed. This allows for accurate analysis based on institutions without further data cleaning.

Verifying Gender column Values

In [14]:
# Unique gender values
print(df['Gender'].unique())

# Number of unique gender values
print("Total unique genders:", df['Gender'].nunique())

# Frequency count of each gender
print(df['Gender'].value_counts())


['Female' 'Male' 'Prefer Not To Say' 'Other']
Total unique genders: 4
Gender
Male                 4861
Female               3367
Prefer Not To Say      15
Other                   3
Name: count, dtype: int64


Note: The Gender column contains four distinct categories: Male, Female, Prefer Not To Say, and Other. The majority of entries are Male (4861) and Female (3367), while Prefer Not To Say (15) and Other (3) represent very few responses. The data is consistent and requires no further cleaning, allowing for reliable analysis of gender-based distribution in the dataset.

Verifying Opportunity Category column's values

In [15]:
# Unique values in Opportunity Category
print(df['Opportunity Category'].unique())

# Number of unique categories
print("Total unique categories:", df['Opportunity Category'].nunique())

# Frequency count of each category
print(df['Opportunity Category'].value_counts())

['Course' 'Competition' 'Internship' 'Event' 'Engagement']
Total unique categories: 5
Opportunity Category
Internship     5242
Course         1935
Event           526
Competition     415
Engagement      128
Name: count, dtype: int64


Note: The Opportunity Category column contains five unique categories: Internship, Course, Event, Competition, and Engagement. The majority of entries are Internship (5242), followed by Course (1935), Event (526), Competition (415), and Engagement (128). The data is consistent, with no variations or misspellings, enabling accurate analysis of opportunities by category.

Verifying Current/Intended Major column's values

In [16]:
# Unique major names
print(df['Current/Intended Major'].unique())

# Number of unique majors
print("Total unique majors:", df['Current/Intended Major'].nunique())

# Frequency count of each major
print(df['Current/Intended Major'].value_counts())


['Radiology' 'Information Systems' 'Computer Science'
 'Mechanical Engineering' 'Computer Science And Engineering'
 'Artificial Intelligence' 'Robotics And Automation Engineering'
 'Data Visualization' 'Business Administration' 'Public Health'
 'Architecture' 'Computer Science And Information Systems' 'Biology'
 'Economics' 'Other' 'Mathematics' 'Bioinformatics'
 'Biomedical Engineering' 'Electrical And Electronic Engineering'
 'Business And Management Studies' 'Electrical And Computer Engineering'
 'Accounting And Finance' 'Secretarial' 'Data Science' 'Statistics'
 'Electronics And Communication' 'Computer Information Systems'
 'Management Information Systems' 'Project Management' 'Medicine'
 'Information' 'Information Technology' 'Actuarial Mathematics'
 'Software Engineering' 'Biological Sciences'
 'Urban And Housing Development' 'Human Resources' 'Cyber Security'
 'Data Analytics' 'Computer Engineering' 'Environmental Sciences'
 'Philosophy' 'Law And Legal Studies' 'Industrial Engi

In [17]:
# Map only unclear/misc entries to "Other"
others_mapping = {
    "Other": "Other",
    "No": "Other",
    "Nil": "Other",
    "Already Graduation Completed": "Other",
    "To Study": "Other",
    "Ghj": "Other",
    "Sdada": "Other",
    "Othe": "Other",
    "Yoganand Sir": "Other",
    "Cycw": "Other",
    "Pos Service": "Other"
}

# Apply mapping
df['Current/Intended Major'] = df['Current/Intended Major'].replace(others_mapping)

# Check updated unique values
print(df['Current/Intended Major'].unique())
print("Total unique majors after cleaning:", df['Current/Intended Major'].nunique())


['Radiology' 'Information Systems' 'Computer Science'
 'Mechanical Engineering' 'Computer Science And Engineering'
 'Artificial Intelligence' 'Robotics And Automation Engineering'
 'Data Visualization' 'Business Administration' 'Public Health'
 'Architecture' 'Computer Science And Information Systems' 'Biology'
 'Economics' 'Other' 'Mathematics' 'Bioinformatics'
 'Biomedical Engineering' 'Electrical And Electronic Engineering'
 'Business And Management Studies' 'Electrical And Computer Engineering'
 'Accounting And Finance' 'Secretarial' 'Data Science' 'Statistics'
 'Electronics And Communication' 'Computer Information Systems'
 'Management Information Systems' 'Project Management' 'Medicine'
 'Information' 'Information Technology' 'Actuarial Mathematics'
 'Software Engineering' 'Biological Sciences'
 'Urban And Housing Development' 'Human Resources' 'Cyber Security'
 'Data Analytics' 'Computer Engineering' 'Environmental Sciences'
 'Philosophy' 'Law And Legal Studies' 'Industrial Engi

Note: The Current/Intended Major column initially contained 363 unique entries, including some unclear or irrelevant entries such as No, Nil, Already Graduation Completed, and other miscellaneous text. These entries have been grouped under “Other”, reducing the total unique majors to 353. All valid majors, such as Computer Science, Data Science, Mechanical Engineering, Business Administration, and Health-related fields, remain unchanged. This cleaning ensures that the column is consistent and suitable for accurate analysis based on learners’ current or intended fields of study.

Verifying Status Description column's values

In [18]:
# Unique status descriptions
print(df['Status Description'].unique())

# Number of unique statuses
print("Total unique status descriptions:", df['Status Description'].nunique())

# Frequency count of each status
print(df['Status Description'].value_counts())


['Started' 'Team Allocated' 'Waitlisted' 'Withdraw' 'Rewards Award'
 'Dropped Out' 'Rejected' 'Applied']
Total unique status descriptions: 8
Status Description
Rejected          3447
Team Allocated    3169
Started            724
Dropped Out        596
Applied            103
Waitlisted          96
Withdraw            82
Rewards Award       29
Name: count, dtype: int64


Note: The Status Description column contains eight unique values, with the most common being Rejected (3447) and Team Allocated (3169). Other statuses include Started, Dropped Out, Applied, Waitlisted, Withdraw, and Rewards Award. The data is consistent and requires no further cleaning, making it ready for accurate analysis of learner progress and engagement.

Verifying Opportunity Name column's values

In [19]:
# Unique opportunity names
print(df['Opportunity Name'].unique())

# Number of unique opportunities
print("Total unique opportunity names:", df['Opportunity Name'].nunique())

# Frequency count of each opportunity
print(df['Opportunity Name'].value_counts())


['Career Essentials: Getting Started With Your Professional Journey'
 'Slide Geeks: A Presentation Design Competition' 'Digital Marketing'
 'Health Care Management' 'Innovation & Entrepreneurship'
 'Project Management' 'Data Visualization' 'Cpr/Aed Certification'
 'Mental And Physical Health Session'
 'Jump Start: Developing Your Emotional Intelligence'
 'Join A Student Organisation' 'Upload Your First Year Transcript'
 'Startup Mastery Workshop' 'Ai Ethics Challenge'
 'Data Visualization Associate' 'Digital Strategy Virtual Internship'
 'Project Management Associate' 'Business Consulting'
 'Urbanrenew Challenge' 'Ux Redesign Challenge'
 'Xperience Design Hackathon' 'Freelance Mastery Workshop']
Total unique opportunity names: 22
Opportunity Name
Career Essentials: Getting Started With Your Professional Journey    1347
Data Visualization                                                    954
Project Management                                                    805
Health Care Managemen

Note: The Opportunity Name column contains 22 unique opportunities, ranging from workshops and certifications to competitions and internships. The most common opportunities are Career Essentials: Getting Started With Your Professional Journey, Data Visualization, Project Management, and Health Care Management, while others like Upload Your First Year Transcript and Slide Geeks: A Presentation Design Competition are less frequent. The data is consistent and requires no further cleaning, enabling accurate analysis of learner participation across different opportunities.

Exporting "Cleaned_Preprocessed_Dataset.csv" after cleaning correctly into "Cleaned_Processed_Dataset.csv"

In [20]:
# Export the cleaned dataset
df.to_csv("Cleaned_Processed_Dataset.csv", index=False)

Viewing First few rows

In [21]:
df = pd.read_csv("Cleaned_Processed_Dataset.csv")

from IPython.display import display
display(df.head(5))

Unnamed: 0,Learner SignUp DateTime,Opportunity Id,Opportunity Name,Opportunity Category,Opportunity End Date,First Name,Date of Birth,Gender,Country,Institution Name,Current/Intended Major,Entry created at,Status Description,Status Code,Apply Date,Opportunity Start Date
0,2023-06-14 12:30:35,00000000-0GN2-A0AY-7XK8-C5FZPP,Career Essentials: Getting Started With Your P...,Course,2024-06-29 18:52:39,Faria,01/12/2001,Female,Pakistan,Northwest Institute Of Health Sciences,Radiology,2024-03-11 12:01:41,Started,1080,2023-06-14 12:36:09,2022-11-03 18:30:39
1,2023-05-01 05:29:16,00000000-0GN2-A0AY-7XK8-C5FZPP,Career Essentials: Getting Started With Your P...,Course,2024-06-29 18:52:39,Poojitha,08/16/2000,Female,India,Saint Louis University,Information Systems,2024-03-11 12:01:41,Started,1080,2023-05-01 06:08:21,2022-11-03 18:30:39
2,2023-04-09 20:35:08,00000000-0GN2-A0AY-7XK8-C5FZPP,Career Essentials: Getting Started With Your P...,Course,2024-06-29 18:52:39,Emmanuel,01/27/2002,Male,United States,Illinois Institute Of Technology,Computer Science,2024-03-11 12:01:41,Started,1080,1900-01-01 00:00:00,2022-11-03 18:30:39
3,2023-08-29 05:20:03,00000000-0GN2-A0AY-7XK8-C5FZPP,Career Essentials: Getting Started With Your P...,Course,2024-06-29 18:52:39,Amrutha Varshini,11/01/1999,Female,United States,Saint Louis University,Information Systems,2024-03-11 12:01:41,Team Allocated,1070,2023-10-09 22:02:42,2022-11-03 18:30:39
4,2023-01-06 15:26:36,00000000-0GN2-A0AY-7XK8-C5FZPP,Career Essentials: Getting Started With Your P...,Course,2024-06-29 18:52:39,Vinay Varshith,04/19/2000,Male,United States,Saint Louis University,Computer Science,2024-03-11 12:01:41,Started,1080,2023-01-06 15:40:10,2022-11-03 18:30:39


Now adding features columns in "Cleaned_Processed_Dataset.csv"

After adding Feature columns in it, renamed as "Processed_Dataset.csv"

Viewing First Few Rows

In [2]:
df = pd.read_csv(r"D:\Excelerate AI Data Internship\Processed_Dataset.csv", encoding='latin1')

from IPython.display import display
display(df.head(5))

Unnamed: 0,Learner SignUp DateTime,Opportunity Id,Opportunity Name,Opportunity Category,Opportunity End Date,First Name,Date of Birth,Gender,Country,Institution Name,...,Encoded Opportunity Category,Encoded Country,Extracted SignUp month,Extracted SignUp Year,Extracted SignUp Day,Weekly Patterns,Seasonal Patterns,Engagement Days,Duration × Age,Engagement Score
0,6/14/2023 12:30,00000000-0GN2-A0AY-7XK8-C5FZPP,Career Essentials: Getting Started With Your P...,Course,6/29/2024 18:52,Faria,1/12/2001,Female,Pakistan,Northwest Institute Of Health Sciences,...,2,5,6,2023,14,Wed,Summer,222.753819,14496.36667,1521.262813
1,5/1/2023 5:29,00000000-0GN2-A0AY-7XK8-C5FZPP,Career Essentials: Getting Started With Your P...,Course,6/29/2024 18:52,Poojitha,8/16/2000,Female,India,Saint Louis University,...,2,2,5,2023,1,Mon,Spring,178.484514,15100.38194,1568.583549
2,4/9/2023 20:35,00000000-0GN2-A0AY-7XK8-C5FZPP,Career Essentials: Getting Started With Your P...,Course,6/29/2024 18:52,Emmanuel,1/27/2002,Male,United States,Illinois Institute Of Technology,...,2,1,4,2023,9,Sun,Spring,-44867.77128,13892.35139,-12066.49625
3,8/29/2023 5:20,00000000-0GN2-A0AY-7XK8-C5FZPP,Career Essentials: Getting Started With Your P...,Course,6/29/2024 18:52,Amrutha Varshini,11/1/1999,Female,United States,Saint Louis University,...,2,1,8,2023,29,Tue,Summer,340.147257,15100.38194,1617.082372
4,1/6/2023 15:26,00000000-0GN2-A0AY-7XK8-C5FZPP,Career Essentials: Getting Started With Your P...,Course,6/29/2024 18:52,Vinay Varshith,4/19/2000,Male,United States,Saint Louis University,...,2,1,1,2023,6,Fri,Winter,63.881609,15100.38194,1534.202677


Viewing All column Names

In [3]:
print(df.columns.tolist())

['Learner SignUp DateTime', 'Opportunity Id', 'Opportunity Name', 'Opportunity Category', 'Opportunity End Date', 'First Name', 'Date of Birth', 'Gender', 'Country', 'Institution Name', 'Current/Intended Major', 'Entry created at', 'Status Description', 'Status Code', 'Apply Date', 'Opportunity Start Date', 'Age ', 'Opportunity Duration', 'Normalized Age', 'Normalized Status Code', 'Normalized Opportunity Duration', 'Encoded Gender', 'Encoded Opportunity Category', 'Encoded Country', 'Extracted SignUp month', 'Extracted SignUp Year', 'Extracted SignUp Day', 'Weekly Patterns', 'Seasonal Patterns', 'Engagement Days', 'Duration × Age', 'Engagement Score']


Verifying if there remain any missing values

In [4]:
missing_values = df.isnull().sum()
print(missing_values)

Learner SignUp DateTime            0
Opportunity Id                     0
Opportunity Name                   0
Opportunity Category               0
Opportunity End Date               0
First Name                         0
Date of Birth                      0
Gender                             0
Country                            0
Institution Name                   0
Current/Intended Major             0
Entry created at                   0
Status Description                 0
Status Code                        0
Apply Date                         0
Opportunity Start Date             0
Age                                0
Opportunity Duration               0
Normalized Age                     0
Normalized Status Code             0
Normalized Opportunity Duration    0
Encoded Gender                     0
Encoded Opportunity Category       0
Encoded Country                    0
Extracted SignUp month             0
Extracted SignUp Year              0
Extracted SignUp Day               0
W

Total Rows & columns

In [5]:
rows, columns = df.shape
print(f"Total rows: {rows}")
print(f"Total columns: {columns}")

Total rows: 8246
Total columns: 32


Checking if there is any negative values

In [8]:
# Check for negative values in numeric columns
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()

for col in numeric_cols:
    neg_count = (df[col] < 0).sum()
    if neg_count > 0:
        print(f"{col}: {neg_count} negative values")
    else:
        print(f"{col}: 0 negative values")


Status Code: 0 negative values
Age : 0 negative values
Opportunity Duration: 509 negative values
Normalized Age: 87 negative values
Normalized Status Code: 3550 negative values
Normalized Opportunity Duration: 0 negative values
Encoded Gender: 0 negative values
Encoded Opportunity Category: 0 negative values
Encoded Country: 0 negative values
Extracted SignUp month: 0 negative values
Extracted SignUp Year: 0 negative values
Extracted SignUp Day: 0 negative values
Engagement Days: 1964 negative values
Duration × Age: 509 negative values
Engagement Score: 596 negative values


cap the outliers using the IQR method (winsorization)

In [9]:
# Handle outliers without dropping rows or creating NaNs
def cap_outliers_iqr(data, col):
    Q1 = data[col].quantile(0.25)
    Q3 = data[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    # Replace values outside bounds with the nearest bound
    data[col] = np.where(data[col] < lower_bound, lower_bound, data[col])
    data[col] = np.where(data[col] > upper_bound, upper_bound, data[col])
    
    # Ensure no negative values remain
    data[col] = data[col].clip(lower=0)
    
    return data

# Columns that had negative values
cols_with_negatives = [
    "Opportunity Duration",
    "Normalized Age",
    "Normalized Status Code",
    "Engagement Days",
    "Duration × Age",
    "Engagement Score"
]

# Apply only to these columns
for col in cols_with_negatives:
    df = cap_outliers_iqr(df, col)

# Double check for any NaN or negative values in these columns
print("Any NaN left? ", df.isnull().values.any())
for col in cols_with_negatives:
    neg_count = (df[col] < 0).sum()
    print(f"{col}: {neg_count} negative values")


Any NaN left?  False
Opportunity Duration: 0 negative values
Normalized Age: 0 negative values
Normalized Status Code: 0 negative values
Engagement Days: 0 negative values
Duration × Age: 0 negative values
Engagement Score: 0 negative values


Note: In the processed dataset, outliers were carefully handled to ensure data integrity without discarding any records. We applied the Interquartile Range (IQR) method, where values below the lower bound and above the upper bound were capped to the nearest valid threshold instead of being removed. This approach, also known as winsorization, preserved the overall distribution while controlling extreme deviations. Additionally, all numeric features were clipped to avoid negative values, and a final validation confirmed that no NaN or missing entries remained. The cleaned and verified dataset was then exported as Verified_Processed_Dataset.csv for further analysis.

Verifying again Total Rows & columns

In [10]:
rows, columns = df.shape
print(f"Total rows: {rows}")
print(f"Total columns: {columns}")

Total rows: 8246
Total columns: 32


Verifying if there remain any missing values

In [11]:
missing_values = df.isnull().sum()
print(missing_values)

Learner SignUp DateTime            0
Opportunity Id                     0
Opportunity Name                   0
Opportunity Category               0
Opportunity End Date               0
First Name                         0
Date of Birth                      0
Gender                             0
Country                            0
Institution Name                   0
Current/Intended Major             0
Entry created at                   0
Status Description                 0
Status Code                        0
Apply Date                         0
Opportunity Start Date             0
Age                                0
Opportunity Duration               0
Normalized Age                     0
Normalized Status Code             0
Normalized Opportunity Duration    0
Encoded Gender                     0
Encoded Opportunity Category       0
Encoded Country                    0
Extracted SignUp month             0
Extracted SignUp Year              0
Extracted SignUp Day               0
W

Checking if there left any negative values

In [12]:
# Check for negative values in numeric columns
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()

for col in numeric_cols:
    neg_count = (df[col] < 0).sum()
    if neg_count > 0:
        print(f"{col}: {neg_count} negative values")
    else:
        print(f"{col}: 0 negative values")


Status Code: 0 negative values
Age : 0 negative values
Opportunity Duration: 0 negative values
Normalized Age: 0 negative values
Normalized Status Code: 0 negative values
Normalized Opportunity Duration: 0 negative values
Encoded Gender: 0 negative values
Encoded Opportunity Category: 0 negative values
Encoded Country: 0 negative values
Extracted SignUp month: 0 negative values
Extracted SignUp Year: 0 negative values
Extracted SignUp Day: 0 negative values
Engagement Days: 0 negative values
Duration × Age: 0 negative values
Engagement Score: 0 negative values


Exporting as "Verified_Processed_Dataset.csv"

In [13]:
# Export the cleaned dataset
output_path = r"D:\Excelerate AI Data Internship\Verified_Processed_Dataset.csv"
df.to_csv(output_path, index=False, encoding='utf-8')

print("Cleaned dataset has been exported successfully to:", output_path)


Cleaned dataset has been exported successfully to: D:\Excelerate AI Data Internship\Verified_Processed_Dataset.csv


# Dataset is Verified !