# Modeling & Evaluation (v1)

> Here, we go create **models** to make predicts and **Evaluation** how well our models learned.

---

# "Training" class
Here, we will create a **Training class** that will be used for training models in the future.

In [1]:
from datetime import datetime

from sklearn.model_selection import train_test_split
from catboost import CatBoostRegressor


class Training:
    
    def split_data(self, x, y):

        start_time = datetime.now()

        X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3, random_state=42)
        print("Data splitted!")

        end_time = datetime.now()
        print('Method runtime: {}'.format(end_time - start_time))

        return X_train, X_valid, y_train, y_valid



    def catBoostRegressor(self, X_train, y_train, X_valid_or_test, cat_features=[0], predict=False):

        start_time = datetime.now()
        
        train_data = X_train # X data.
        train_labels = y_train # y data.
        
        model = CatBoostRegressor()
        model.fit(
            train_data,
            train_labels,
            cat_features
        )
        
        if predict is True:
            salaries_predicted = pd.DataFrame(model.predict(X_valid_or_test), columns=["SalaryNormalized"])
            end_time = datetime.now()
            print('Method runtime: {}'.format(end_time - start_time))
            return salaries_predicted

In [2]:
# Training instance.
training = Training()

---

# Import Preprocessing class
We also have a Preprocessing class to work with the data. Let's go use she to work here:

In [3]:
%run "../src/preprocessing.py"

In [4]:
preprocessing = Preprocessing()

# Getting Training and Testing sets

In [5]:
# Extract training set.
preprocessing.extract_7z_data("../datasets/Train_rev1.7z")

File extracted!
Method runtime: 0:00:05.883333


In [6]:
df_training = preprocessing.get_training_data()

Training data is ready!
Method runtime: 0:00:04.814837


---

# Getting Independent and dependent (target) variables to training the model

In [7]:
X = df_training.drop(columns=[
    'Id',
    'Title',
    'FullDescription',
    'LocationRaw',
    'LocationNormalized',
    'Company',
    'Category',
    'SalaryRaw',
    'SalaryNormalized',
    'SourceName'
]).fillna(0)
X

Unnamed: 0,ContractType,ContractTime
0,0,permanent
1,0,permanent
2,0,permanent
3,0,permanent
4,0,permanent
...,...,...
244763,0,contract
244764,0,contract
244765,0,contract
244766,0,contract


In [8]:
y = df_training["SalaryNormalized"]
y

0         25000
1         30000
2         30000
3         27500
4         25000
          ...  
244763    22800
244764    22800
244765    22800
244766    22800
244767    42500
Name: SalaryNormalized, Length: 244768, dtype: int64

---

# Split data into Training and Testing

In [9]:
X_train, X_valid, y_train, y_valid = training.split_data(X, y)

Data splitted!
Method runtime: 0:00:00.031475


In [10]:
X_train

Unnamed: 0,ContractType,ContractTime
241453,0,permanent
169520,full_time,contract
136661,full_time,0
125435,0,contract
192568,0,permanent
...,...,...
119879,0,permanent
103694,0,permanent
131932,0,0
146867,0,0


In [11]:
X_valid

Unnamed: 0,ContractType,ContractTime
30390,0,permanent
108709,full_time,0
13924,full_time,permanent
154606,full_time,permanent
172891,0,permanent
...,...,...
196783,0,permanent
157358,0,permanent
73696,0,permanent
142365,part_time,0


In [12]:
y_train

241453    19000
169520    29172
136661    23000
125435    13440
192568    48000
          ...  
119879    79000
103694    19000
131932    52000
146867    30000
121958    12950
Name: SalaryNormalized, Length: 171337, dtype: int64

In [13]:
y_valid

30390     23500
108709    23040
13924     34850
154606    22500
172891    55000
          ...  
196783    51000
157358    27000
73696     15500
142365    27819
105535    21500
Name: SalaryNormalized, Length: 73431, dtype: int64

---

# Training the model

In [15]:
from catboost import Pool

# Encapsulate training data.
pool_train = Pool(
    X_train,
    y_train,
    cat_features = ['ContractType', 'ContractTime'],
)

# Encapsulate validate data.
pool_valid = Pool(
    X_valid,
    y_valid,
    cat_features = ['ContractType', 'ContractTime'],
)

In [16]:
from catboost import CatBoostRegressor

model = CatBoostRegressor()

model.fit(
    pool_train,
    eval_set=pool_valid,
    silent=True,
)

<catboost.core.CatBoostRegressor at 0x1ee9e853370>

# Making predicts
Now, let's do some predicts:

In [17]:
salaries_predicted = model.predict(X_valid)

In [18]:
salaries_predicted.shape

(73431,)

In [19]:
salaries_predicted

array([35356.20052772, 24519.87864313, 35898.96578245, ...,
       35356.20052772, 18477.37896706, 35356.20052772])

# Comparing predicts with "y_valid"

In [20]:
df_salaries_predicted = pd.DataFrame(salaries_predicted)
df_salaries_predicted.columns = ['salary_predicted']
df_salaries_predicted.describe()

Unnamed: 0,salary_predicted
count,73431.0
mean,34160.545258
std,4154.619585
min,18477.378967
25%,35356.200528
50%,35356.200528
75%,35767.796688
max,36596.411695


In [21]:
df_salaries_predicted.mode()

Unnamed: 0,salary_predicted
0,35356.200528


In [22]:
df_y_valid = pd.DataFrame(y_valid)
df_y_valid.columns = ['y_valid']
df_y_valid.describe()

Unnamed: 0,y_valid
count,73431.0
mean,34070.297531
std,17589.390641
min,5000.0
25%,21500.0
50%,30000.0
75%,42500.0
max,200000.0


In [23]:
df_y_valid.mode()

Unnamed: 0,y_valid
0,35000


 - **Min value:**
   - Salary predicted: 18.477
   - y_valid: 5.000
 - **Max value:**
   - Salary predicted: 36.596
   - y_valid: 200.000
 - **Mean:**
   - Salary predicted: 34.160
   - y_valid: 34.070
 - **Median:**
   - Salary predicted: 35.356
   - y_valid: 30.000
 - **Mode:**
   - Salary predicted: 35.356
   - y_valid: 35.000
 - **Standard Deviation:**
   - Salary predicted: 4.154
   - y_valid: 17.589

---

# Evaluation the model

> Finally, let's go **Evaluation the model**.

The **Evaluation Metric** is **[MAE](https://en.wikipedia.org/wiki/Mean_absolute_error)**.

In [24]:
from sklearn.metrics import mean_absolute_error

In [25]:
mae = mean_absolute_error(y_valid, salaries_predicted)
mae

12877.813536658401

---

# Modeling & Evaluation (v1) - Resume

 - **In this model, we use the features:**
   - **Independent variables:**
     - ContractType
     - ContractTime
   - **Dependent variables:**
     - SalaryNormalized
 - **Preprocessing:**
   - We only apply "fillna = 0" to missing data.
     - ContractType had 73% missing data.
     - ContractTime had 26% missing data.
   - **NOTE:**
     - We have many missing data, but the focus for now is creating a baseline model (baseline, dummy, PoC, prototype).
     - That's, creates a more simple model possible.
 - **Comparison between predicted data and validation data (y_valid):**
   - **Min value:**
     - Salary predicted: 18.477
     - y_valid: 5.000
   - **Max value:**
     - Salary predicted: 36.596
     - y_valid: 200.000
   - **Mean:**
     - Salary predicted: 34.160
     - y_valid: 34.070
   - **Median:**
     - Salary predicted: 35.356
     - y_valid: 30.000
   - **Mode:**
     - Salary predicted: 35.356
     - y_valid: 35.000
   - **Standard Deviation:**
     - Salary predicted: 4.154
     - y_valid: 17.589
 - **The result of Evaluation Metric (MAE) was:**
   - 12877.813536658401

---

Ro**drigo** **L**eite da **S**ilva - **drigols**