# BOUN AI Task 2

## Nilüfer Çetin

This task encompasses a set of operations and manipulations on a dataset obtained from [Kaggle](https://www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists). Normalization will be used for transforming the numerical attributes into a suitable range whereas one-hot encoding will be performed to transform the categorical variables to a form that can be more easily used for statistical learning. 

Necessary packages are imported,

In [45]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

First the dataset is read and printed,

In [46]:
data = pd.read_csv("C:/Users/ŞAHİN ÇETİN/Desktop/aug_train.csv", engine='python')
print(data)

       enrollee_id      city  city_development_index gender  \
0             8949  city_103                   0.920   Male   
1            29725   city_40                   0.776   Male   
2            11561   city_21                   0.624    NaN   
3            33241  city_115                   0.789    NaN   
4              666  city_162                   0.767   Male   
...            ...       ...                     ...    ...   
19153         7386  city_173                   0.878   Male   
19154        31398  city_103                   0.920   Male   
19155        24576  city_103                   0.920   Male   
19156         5756   city_65                   0.802   Male   
19157        23834   city_67                   0.855    NaN   

           relevent_experience enrolled_university education_level  \
0      Has relevent experience       no_enrollment        Graduate   
1       No relevent experience       no_enrollment        Graduate   
2       No relevent experience   

The rows containing NaN value (missing observation) are dropped from the set,

In [47]:
data = data.dropna()

In [48]:
print(data)

       enrollee_id      city  city_development_index  gender  \
1            29725   city_40                   0.776    Male   
4              666  city_162                   0.767    Male   
7              402   city_46                   0.762    Male   
8            27107  city_103                   0.920    Male   
11           23853  city_103                   0.920    Male   
...            ...       ...                     ...     ...   
19147        21319   city_21                   0.624    Male   
19149          251  city_103                   0.920    Male   
19150        32313  city_160                   0.920  Female   
19152        29754  city_103                   0.920  Female   
19155        24576  city_103                   0.920    Male   

           relevent_experience enrolled_university education_level  \
1       No relevent experience       no_enrollment        Graduate   
4      Has relevent experience       no_enrollment         Masters   
7      Has relevent e

# 1- Normalization for Numerical Data

For machine learning and deep learning purposes, it is important for the numerical features to be in close ranges. The solutions to the objective functions of various learning techniques from these domains are optimized with an algorithm called **gradient descent**. This is an iterative algorithm and hence the CPU time might change with the parameter and domain selection of the objective function. For gradient descent to work more efficiently in a smaller time, the numerical features should be brought into proper ranges such as ${[-1,1]}$. Else, the objective value might oscillate back and forward, increasing the run time.

There are several ways to normalize the numeric variables. The technique chosen on this study is to first de-mean the data by,

${\hat{x_i} = x_i - \bar{x}}$ where,

${\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}}$

Then normalizing the data by,

${\hat{x_i} = x_i / \sigma^2}$ where,

${\sigma^2}$ is the **variance** of the de-meaned vector.

In this dataset, the columns with numerical data are, **city_development_index** and **training_hours**. Nevertheless, as mentioned before normalization is done to transform the numerical data into a more suitable range. The **city_development_index** is already between ${0}$ and ${1}$, so it will not be normalized. Moreover, for easier representation, **experience** and **last_new_job** columns will also be converted into numeric form. Where for **experience**,

${<1}$ will be ${0}$ and ${>20}$ will be ${21}$.

For **last_new_job**,

${Never}$ will be ${0}$ and ${>4}$ will be ${5}$.

In [49]:
data.loc[data['experience'] == '<1', 'experience'] = 0
data.loc[data['experience'] == '>20', 'experience'] = 21
data.loc[data['last_new_job'] == 'never', 'last_new_job'] = 0
data.loc[data['last_new_job'] == '>4', 'last_new_job'] = 5

Now these columns can also be casted into a numeric type,

In [50]:
data['experience'] = pd.to_numeric(data['experience'])
data['last_new_job'] = pd.to_numeric(data['last_new_job'])

In [51]:
def normalizer(data, cols):
    mean = data[cols].mean()
    data[cols] -= mean
    var = data[cols].var()
    data[cols] /= var
    return data, mean, var

In [52]:
columns = np.array(['training_hours', 'experience', 'last_new_job'])
stats = np.empty((3,2))
for i in range(0, 3):
    data, stats[i, 0], stats[i, 1] = normalizer(data, columns[i])

In [53]:
#print(data)
stats
#data['last_new_job'].mean()

array([[6.50749302e+01, 3.62826566e+03],
       [1.16356226e+01, 4.28474465e+01],
       [2.34829704e+00, 2.78475112e+00]])

For future use, the normalized feature set is saved into a numpy matrix,

In [54]:
final = (data[['city_development_index', 'training_hours', 'experience', 'last_new_job']]).values

# 2- One-Hot Encoding for Categorical Data

One Hot Encoding is especially important for neural network tasks. Normally, for linear regression purposes it is common to represent categorical data by **integer encoding**, in which each integer starting from 0 or 1 represents a unique category on the specific feature. Though, this might be efficient for storage and maybe readability, modern neural network tasks are highly empowered with matrix representations and multiplications, and imposes a corelation between the categories. Therefore, for computational purposes, and eliminating the undesired hierarchy **One-Hot Encoding** is more preferrable where the observations with respect to a categorical feature are kept as a **sparse matrix** of zeroes and ones, ${1}$ representing the category of the observation.

For this task, the features in strings are first mapped into integers to represent different categories. Then, each observation is mapped into a vector via binary encoding. The resulting observation set is a **numpy matrix** with normalized numerical variables and one hot encoded categorical features for the 8955 observations. This is a final matrix in which the labels are missing.

In [55]:
categoricals = np.array(['city', 'gender', 'relevent_experience', 'enrolled_university', 'education_level', 'major_discipline', 'company_size', 'company_type'])
for i in categoricals:
  values = np.array(data[i])
  label_encoder = LabelEncoder()
  integer_encoded = label_encoder.fit_transform(values)
  onehot_encoder = OneHotEncoder(sparse=False)
  integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
  final = np.hstack((final, onehot_encoder.fit_transform(integer_encoded))) 

In [56]:
final.shape

(8955, 151)

The labels of various types of features will be kept in another array, which will be useful for the next section. It should be kept in mind that while the **scipy** is processing the encoding, different categories/levels of the same type of feature are sorted with respect to their alphabetical ordering. Therefore, while constructing the labels array, this sorting will be done to ensure there will not be any mistakes in readability and use.

In [57]:
columns = np.array(['city_development_index', 'training_hours', 'experience', 'last_new_job'])

for i in categoricals:
    uniques = np.array(sorted(data[i].unique()))
    columns = np.concatenate((columns, uniques))

# 3- Reverting the Data Back

For further use when the target variable is a string instead of integer, the variables are needed to be turned back to the form of initial observation state.

## a- One Hot Encoded Data

With this type of encoding, with the correctly sorted labels and matrix of ${0}$s and ${1}$s, the process is fairly easy. If only the columns of categorical data is selected from the final numpy matrix, there are **same** number of ${1}$s in each row, since each observation has the **same** number of properties. Then, if each observation of ${1}$ is brought into its correct type of feature and if all ${0}$s are deleted, then the initial form can be retained for categorical data. For this purpose, a final 2D array to be filled with categorical features is initialized,

In [58]:
revert = np.empty((8955, 147), dtype="<U32")

Each row of the array is filled with correct and empty observations by the help of string-integer multiplication,

In [59]:
for i in range(0, len(final)):
    for j in range(4, 151):
            revert[i, j-4] = int(final[i, j]) * str(columns[j])

To delete the empty observations, meaning that the observation do not fall under this type of category, the array is brought from 2D to 1D. After the removal, a final numpy array of categorical features of each row is retained,

In [60]:
revert = revert.ravel()
revert = np.delete(revert, np.argwhere(revert == ''))
revert = np.reshape(revert, (8955,8))

In [146]:
revert

array([['city_40', 'Male', 'No relevent experience', ..., 'STEM',
        '50-99', 'Pvt Ltd'],
       ['city_162', 'Male', 'Has relevent experience', ..., 'STEM',
        '50-99', 'Funded Startup'],
       ['city_46', 'Male', 'Has relevent experience', ..., 'STEM', '<10',
        'Pvt Ltd'],
       ...,
       ['city_160', 'Female', 'Has relevent experience', ..., 'STEM',
        '100-500', 'Public Sector'],
       ['city_103', 'Female', 'Has relevent experience', ...,
        'Humanities', '10/49', 'Funded Startup'],
       ['city_103', 'Male', 'Has relevent experience', ..., 'STEM',
        '50-99', 'Pvt Ltd']], dtype='<U32')

## b- Normalized Data

To turn back this type of operations, the user should save the **mean** and **variance** of each feature set. By the inverse set of computations, the data is reverted back.

In [61]:
for i in range(1 ,4):
    final[:,i]*=stats[i-1,1]
    final[:,i]+=stats[i-1,0]

In [62]:
final[:,0:4]

array([[ 0.776, 47.   , 15.   ,  5.   ],
       [ 0.767,  8.   , 21.   ,  4.   ],
       [ 0.762, 18.   , 13.   ,  5.   ],
       ...,
       [ 0.92 , 23.   , 10.   ,  3.   ],
       [ 0.92 , 25.   ,  7.   ,  1.   ],
       [ 0.92 , 44.   , 21.   ,  4.   ]])

Lastly, the two numpy arrays can be merged in a data frame by,

In [74]:
back = pd.DataFrame(final[:,0:4], columns=columns[0:4])
back2 = pd.DataFrame(revert, columns=categoricals)
origin = pd.concat([back, back2],axis=1)

In [75]:
origin

Unnamed: 0,city_development_index,training_hours,experience,last_new_job,city,gender,relevent_experience,enrolled_university,education_level,major_discipline,company_size,company_type
0,0.776,47.0,15.0,5.0,city_40,Male,No relevent experience,no_enrollment,Graduate,STEM,50-99,Pvt Ltd
1,0.767,8.0,21.0,4.0,city_162,Male,Has relevent experience,no_enrollment,Masters,STEM,50-99,Funded Startup
2,0.762,18.0,13.0,5.0,city_46,Male,Has relevent experience,no_enrollment,Graduate,STEM,<10,Pvt Ltd
3,0.920,46.0,7.0,1.0,city_103,Male,Has relevent experience,no_enrollment,Graduate,STEM,50-99,Pvt Ltd
4,0.920,108.0,5.0,1.0,city_103,Male,Has relevent experience,no_enrollment,Graduate,STEM,5000-9999,Pvt Ltd
...,...,...,...,...,...,...,...,...,...,...,...,...
8950,0.624,52.0,1.0,1.0,city_21,Male,No relevent experience,Full time course,Graduate,STEM,100-500,Pvt Ltd
8951,0.920,36.0,9.0,1.0,city_103,Male,Has relevent experience,no_enrollment,Masters,STEM,50-99,Pvt Ltd
8952,0.920,23.0,10.0,3.0,city_160,Female,Has relevent experience,no_enrollment,Graduate,STEM,100-500,Public Sector
8953,0.920,25.0,7.0,1.0,city_103,Female,Has relevent experience,no_enrollment,Graduate,Humanities,10/49,Funded Startup
