The Head Data Scientist at Training Data Ltd. has asked you to create a DataFrame called ds_jobs_transformed that stores the data in customer_train.csv much more efficiently. Specifically, they have set the following requirements:


* Columns containing categories with only two factors must be stored as Booleans (bool).
* Columns containing integers only must be stored as 32-bit integers (int32).
* Columns containing floats must be stored as 16-bit floats (float16).
* Columns containing nominal categorical data must be stored as the category data type.
* Columns containing ordinal categorical data must be stored as ordered categories, and not mapped to numerical values, with an order that reflects the natural order of the column.
* The DataFrame should be filtered to only contain students with 10 or more years of experience at companies with at least 1000 employees, as their recruiter base is suited to more experienced professionals at enterprise companies.
* If you call .info() or .memory_usage() methods on ds_jobs and ds_jobs_transformed after you've preprocessed it, you should notice a substantial decrease in memory usage.


A common problem when creating models to generate business value from data is that the datasets can be so large that it can take days for the model to generate predictions. Ensuring that your dataset is stored as efficiently as possible is crucial for allowing these models to run on a more reasonable timescale without having to reduce the size of the dataset.

You've been hired by a major online data science training provider called Training Data Ltd. to clean up one of their largest customer datasets. This dataset will eventually be used to predict whether their students are looking for a new job or not, information that they will then use to direct them to prospective recruiters.

In [35]:
import pandas as pd

# Load the dataset
ds_jobs = pd.read_csv("/kaggle/input/customer-train/customer_train.csv")

# View the dataset
ds_jobs.head()

Unnamed: 0,student_id,city,city_development_index,gender,relevant_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,job_change
0,8949,city_103,0.92,Male,Has relevant experience,no_enrollment,Graduate,STEM,>20,,,1,36,1
1,29725,city_40,0.776,Male,No relevant experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0
2,11561,city_21,0.624,,No relevant experience,Full time course,Graduate,STEM,5,,,never,83,0
3,33241,city_115,0.789,,No relevant experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1
4,666,city_162,0.767,Male,Has relevant experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0


In [36]:
# Create a copy of ds_jobs for transforming
ds_jobs_transformed = ds_jobs.copy()

# Exploratory data analysis

Load customer_train.csv to begin exploring the data to understand the contents and data types of the values in each column


How to find the data types and contents of a column
You can check the column names and assigned data types by calling .info() on the DataFrame.
The .value_counts() method can be used to view the unique values and counts present in a colhe
* How to determine what a columns data type should be
* Columns containing only two unique values with yes/no style values (two-factor categories), should be converted to the bool data type.
* Columns containing a small number of unique values with no natural ordering should be set to the category data type, as they contain nominal categorical data.
* Columns containing categorical data with a natural ordering should be converted to ordered categories, as they contain ordinal categorical da
ategorical data.

In [37]:
# find the data types and contents of a column
ds_jobs_transformed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19158 entries, 0 to 19157
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   student_id              19158 non-null  int64  
 1   city                    19158 non-null  object 
 2   city_development_index  19158 non-null  float64
 3   gender                  14650 non-null  object 
 4   relevant_experience     19158 non-null  object 
 5   enrolled_university     18772 non-null  object 
 6   education_level         18698 non-null  object 
 7   major_discipline        16345 non-null  object 
 8   experience              19093 non-null  object 
 9   company_size            13220 non-null  object 
 10  company_type            13018 non-null  object 
 11  last_new_job            18735 non-null  object 
 12  training_hours          19158 non-null  int64  
 13  job_change              19158 non-null  int64  
dtypes: float64(1), int64(3), object(10)
me

In [38]:
for col in ds_jobs.select_dtypes("object").columns:
    print(ds_jobs_transformed[col].value_counts(), '\n')

city
city_103    4355
city_21     2702
city_16     1533
city_114    1336
city_160     845
            ... 
city_129       3
city_111       3
city_121       3
city_140       1
city_171       1
Name: count, Length: 123, dtype: int64 

gender
Male      13221
Female     1238
Other       191
Name: count, dtype: int64 

relevant_experience
Has relevant experience    13792
No relevant experience      5366
Name: count, dtype: int64 

enrolled_university
no_enrollment       13817
Full time course     3757
Part time course     1198
Name: count, dtype: int64 

education_level
Graduate          11598
Masters            4361
High School        2017
Phd                 414
Primary School      308
Name: count, dtype: int64 

major_discipline
STEM               14492
Humanities           669
Other                381
Business Degree      327
Arts                 253
No Major             223
Name: count, dtype: int64 

experience
>20    3286
5      1430
4      1403
3      1354
6      1216
2      1127
7 

# Converting integers, floats, and unordered categories

Convert columns containing integers to the int32 type, floats to the float16 type, nominal categories to the category type, and two-factor categories to the bool type.)

**Columns containing integers must be stored as 32-bit integers (int32)**

In [39]:
int_variables = ['student_id', 'training_hours', 'job_change']
print(ds_jobs_transformed[int_variables].dtypes)
ds_jobs_transformed[int_variables] = ds_jobs_transformed[int_variables].astype(dtype='int32')
print(ds_jobs_transformed[int_variables].dtypes)

student_id        int64
training_hours    int64
job_change        int64
dtype: object
student_id        int32
training_hours    int32
job_change        int32
dtype: object


**Columns containing floats must be stored as 16-bit floats (float16)**

In [40]:
float_variables = ['city_development_index']
print(ds_jobs_transformed[float_variables].dtypes)
ds_jobs_transformed[float_variables] = ds_jobs_transformed[float_variables].astype(dtype='float16')
print(ds_jobs_transformed[float_variables].dtypes)

city_development_index    float64
dtype: object
city_development_index    float16
dtype: object


**Columns containing nominal categorical data must be stored as the category data type**

In [41]:
object_variables = [col for col in ds_jobs_transformed.columns if ds_jobs_transformed[col].dtype == "O"]
print(object_variables)
print(ds_jobs_transformed[object_variables].dtypes)

['city', 'gender', 'relevant_experience', 'enrolled_university', 'education_level', 'major_discipline', 'experience', 'company_size', 'company_type', 'last_new_job']
city                   object
gender                 object
relevant_experience    object
enrolled_university    object
education_level        object
major_discipline       object
experience             object
company_size           object
company_type           object
last_new_job           object
dtype: object


In [42]:
ds_jobs_transformed[object_variables] = ds_jobs_transformed[object_variables].astype(dtype='category')
print(ds_jobs_transformed[object_variables].dtypes)

city                   category
gender                 category
relevant_experience    category
enrolled_university    category
education_level        category
major_discipline       category
experience             category
company_size           category
company_type           category
last_new_job           category
dtype: object


# Converting ordered categories

Convert columns containing ordinal categorical data into ordered categories.

**Columns containing ordinal categorical data must be stored as ordered categories, and not mapped to numerical values, with an order that reflects the natural order of the column**

In [43]:
object_variables = [col for col in ds_jobs_transformed.columns if ds_jobs_transformed[col].dtype == 'category']
for col in object_variables:
    print(f"_{col.upper()}_".center(30, '*'))
    print(f'{col}: {ds_jobs_transformed[col].unique()}', "\n")

************_CITY_************
city: ['city_103', 'city_40', 'city_21', 'city_115', 'city_162', ..., 'city_121', 'city_129', 'city_8', 'city_31', 'city_171']
Length: 123
Categories (123, object): ['city_1', 'city_10', 'city_100', 'city_101', ..., 'city_94', 'city_97', 'city_98', 'city_99'] 

***********_GENDER_***********
gender: ['Male', NaN, 'Female', 'Other']
Categories (3, object): ['Female', 'Male', 'Other'] 

****_RELEVANT_EXPERIENCE_*****
relevant_experience: ['Has relevant experience', 'No relevant experience']
Categories (2, object): ['Has relevant experience', 'No relevant experience'] 

****_ENROLLED_UNIVERSITY_*****
enrolled_university: ['no_enrollment', 'Full time course', NaN, 'Part time course']
Categories (3, object): ['Full time course', 'Part time course', 'no_enrollment'] 

******_EDUCATION_LEVEL_*******
education_level: ['Graduate', 'Masters', 'High School', NaN, 'Phd', 'Primary School']
Categories (5, object): ['Graduate', 'High School', 'Masters', 'Phd', 'Primary 


* Creating an ordered list of work experience values
* In ascending order, the work experience values should be <1, then the numbers one through to twenty, then finish with >20.
* There are lots of ways to create this list:You could use the .unique() and .sort() methods on the column, then convert to a list and rearrange to ensure <1 and >20 are in the correct position.
* Create a list of numbers 1-20, map the numbers to strings, and concatenate ["<1"] and [">20"] to the beginning and end, respectively.
ely.

In [44]:
ordered_categories = {
    'relevant_experience': ['No relevant experience', 'Has relevant experience'],
    'enrolled_university': ['no_enrollment', 'Part time course', 'Full time course'],
    'education_level': ['Primary School', 'High School', 'Graduate', 'Masters', 'Phd'],
    'experience': ['<1'] + list(map(str, range(1, 21))) + ['>20'],
    'company_size': ['<10', '10-49', '50-99', '100-499', '500-999', '1000-4999', '5000-9999', '10000+'],
    'last_new_job': ['never', '1', '2', '3', '4', '>4']
}
for col in ds_jobs_transformed:
    for col in ordered_categories.keys():
        category = pd.CategoricalDtype(ordered_categories[col], ordered=True)
        ds_jobs_transformed[col] = ds_jobs_transformed[col].astype(category)

print(ds_jobs_transformed)

       student_id      city  city_development_index gender  \
0            8949  city_103                0.919922   Male   
1           29725   city_40                0.775879   Male   
2           11561   city_21                0.624023    NaN   
3           33241  city_115                0.789062    NaN   
4             666  city_162                0.767090   Male   
...           ...       ...                     ...    ...   
19153        7386  city_173                0.877930   Male   
19154       31398  city_103                0.919922   Male   
19155       24576  city_103                0.919922   Male   
19156        5756   city_65                0.801758   Male   
19157       23834   city_67                0.854980    NaN   

           relevant_experience enrolled_university education_level  \
0      Has relevant experience       no_enrollment        Graduate   
1       No relevant experience       no_enrollment        Graduate   
2       No relevant experience    Full time c

# Filtering on ordered categorical columns

Filter the DataFrame to only contain students with 10 or more years of experience at companies with at least 1000 employees.

In [45]:
# Filter students with 10 or more years experience at companies with at least 1000 employees
ds_jobs_transformed = ds_jobs_transformed[(ds_jobs_transformed['experience'] >= '10') & 
                      (ds_jobs_transformed['company_size'] >= '1000-4999')]

print(ds_jobs_transformed)

       student_id      city  city_development_index  gender  \
9             699  city_103                0.919922     NaN   
12          25619   city_61                0.913086    Male   
31          22293  city_103                0.919922    Male   
34          26494   city_16                0.910156    Male   
40           2547  city_114                0.925781  Female   
...           ...       ...                     ...     ...   
19097       25447  city_103                0.919922    Male   
19101        6803   city_16                0.910156    Male   
19103       32932   city_10                0.895020    Male   
19128        3365   city_16                0.910156     NaN   
19143       33047  city_103                0.919922    Male   

           relevant_experience enrolled_university education_level  \
9      Has relevant experience       no_enrollment        Graduate   
12     Has relevant experience       no_enrollment        Graduate   
31     Has relevant experience   

In [46]:
print(ds_jobs_transformed["relevant_experience"])

9        Has relevant experience
12       Has relevant experience
31       Has relevant experience
34       Has relevant experience
40       Has relevant experience
                  ...           
19097    Has relevant experience
19101    Has relevant experience
19103    Has relevant experience
19128    Has relevant experience
19143    Has relevant experience
Name: relevant_experience, Length: 2201, dtype: category
Categories (2, object): ['No relevant experience' < 'Has relevant experience']


# two-factor categories to the bool type

In [47]:
print(ds_jobs_transformed["relevant_experience"])

9        Has relevant experience
12       Has relevant experience
31       Has relevant experience
34       Has relevant experience
40       Has relevant experience
                  ...           
19097    Has relevant experience
19101    Has relevant experience
19103    Has relevant experience
19128    Has relevant experience
19143    Has relevant experience
Name: relevant_experience, Length: 2201, dtype: category
Categories (2, object): ['No relevant experience' < 'Has relevant experience']


In [48]:
# Create a mapping dictionary of columns containing two-factor categories to convert to Booleans
two_factor_cats = {
    'relevant_experience': {'No relevant experience': False, 'Has relevant experience': True},
    'job_change': {0.0: False, 1.0: True}
}

In [49]:
# Loop through DataFrame columns to efficiently change data types
for col in ds_jobs_transformed:
    
    # Convert two-factor categories to bool
    if col in ['relevant_experience', 'job_change']:
        ds_jobs_transformed[col] = ds_jobs_transformed[col].map(two_factor_cats[col])

print(ds_jobs_transformed["relevant_experience"])

9        True
12       True
31       True
34       True
40       True
         ... 
19097    True
19101    True
19103    True
19128    True
19143    True
Name: relevant_experience, Length: 2201, dtype: category
Categories (2, bool): [False < True]


In [50]:
print(ds_jobs_transformed["job_change"])

9        False
12       False
31       False
34       False
40       False
         ...  
19097    False
19101    False
19103    False
19128    False
19143    False
Name: job_change, Length: 2201, dtype: bool
