## Baselines

You must compare your fancy methods with simple baselines, e.g., random guess, all-positive, all-negative, simple linear models, and beat the baselines. The evaluation metric must be the one required by the competition.

In [1]:
import numpy as np
import pandas as pd
from glob import glob

import matplotlib.pyplot as plt

# Import our files
import data
import baseline_models

## Data Preprocessing

In [2]:
train_images = "data/train_images/*"
test_images = "data/teset_images/*"

breed_labels_file = "data/breed_labels.csv"
state_labels_file = "data/state_labels.csv"
color_labels_file = "data/color_labels.csv"
test_file = "data/test.csv"
train_file = "data/train.csv"

In [3]:
# Load all data
train_df = pd.read_csv("data/train/train.csv")
test_df = pd.read_csv("data/test/test.csv")

# Labels contain information for analyzing results, but not for model
breed_labels_df = pd.read_csv("data/breed_labels.csv")
state_labels_df = pd.read_csv("data/state_labels.csv")
color_labels_df = pd.read_csv("data/color_labels.csv")

In [4]:
# For linear regression just keep the numeric columns
y_col = 'AdoptionSpeed'

train_x_df = train_df[data.numeric_cols].copy()
train_y_df = train_df[[y_col]].copy()
test_x_df = test_df[data.numeric_cols].copy()

In [5]:
for col, num_class in data.one_hot_cols.items():
    if col is not 'Breed1' and col is not 'Breed2':
        data.one_hot_encode(train_x_df, col, num_class)
        data.one_hot_encode(test_x_df, col, num_class)
        print("One hot encoding {} with {} classes...".format(col, num_class))

One hot encoding Type with 2 classes...
One hot encoding Gender with 3 classes...
One hot encoding Color1 with 7 classes...
One hot encoding Color2 with 7 classes...
One hot encoding Color3 with 7 classes...
One hot encoding MaturitySize with 5 classes...
One hot encoding FurLength with 4 classes...
One hot encoding Vaccinated with 4 classes...
One hot encoding Dewormed with 4 classes...
One hot encoding Sterilized with 4 classes...
One hot encoding Health with 4 classes...
One hot encoding State with 15 classes...


In [6]:
final_columns = list(set(list(train_x_df.columns.values) + list(test_x_df.columns)))
print(len(final_columns), final_columns)

64 ['Color1_1', 'Color3_3', 'Color1_3', 'Color1_6', 'Vaccinated_3', 'Color2_0', 'Color2_5', 'Vaccinated_1', 'Color3_7', 'Color2_4', 'State_41361', 'Dewormed_2', 'Health_2', 'Color2_2', 'Color1_5', 'Vaccinated_2', 'Color1_4', 'Color1_7', 'State_41401', 'Color3_5', 'State_41335', 'Sterilized_3', 'Gender_2', 'State_41367', 'Type', 'Age', 'PhotoAmt', 'Breed2', 'Gender_1', 'Quantity', 'Breed1', 'VideoAmt', 'State_41415', 'Color2_3', 'MaturitySize_3', 'State_41336', 'MaturitySize_4', 'FurLength_2', 'Color2_6', 'Color3_4', 'State_41325', 'State_41327', 'State_41345', 'Health_3', 'Fee', 'Sterilized_1', 'Color1_2', 'State_41332', 'Color3_0', 'MaturitySize_1', 'State_41326', 'FurLength_3', 'Gender_3', 'State_41330', 'Color2_7', 'Color3_6', 'State_41324', 'Dewormed_3', 'State_41342', 'Health_1', 'FurLength_1', 'MaturitySize_2', 'Dewormed_1', 'Sterilized_2']


In [7]:
# Make sure that train and test have the same columns 
for col in final_columns:
    if col not in train_x_df.columns:
        print("Adding column '{}' to train_x_df".format(col))
    if col not in test_x_df.columns:
        print("Adding column '{}' to test_x_df".format(col))
        test_x_df[col] = 0
    

Adding column 'State_41415' to test_x_df


In [8]:
# Create a validation set from training set
msk = np.random.rand(len(train_x_df)) < 0.8
valid_x_df = train_x_df[~msk]
train_x_df = train_x_df[msk]
valid_y_df = train_y_df[~msk]
train_y_df = train_y_df[msk]

print("Train x shape: {}, Train y shape {}".format(train_x_df.shape, train_y_df.shape))
print("Valid x shape: {}, Valid y shape {}".format(valid_x_df.shape, valid_y_df.shape))
print("Test shape: {}". format(test_x_df.shape))

Train x shape: (12027, 64), Train y shape (12027, 1)
Valid x shape: (2966, 64), Valid y shape (2966, 1)
Test shape: (3948, 64)


## Run random guess

In [9]:
random_guess_y = baseline_models.random_guess(train_x_df)

In [10]:
print( len(random_guess_y) )
print( len(train_x_df) )

12027
12027


In [11]:
random_guess_pred = baseline_models.run_random_guess((train_x_df, train_y_df), (train_x_df, train_y_df), test_x_df)

Running random guess...
'Training' accuracy: 0.2002993265153405
'Validation' accuracy: 0.19963415648125052


In [12]:
print(random_guess_pred)

[0 4 3 ... 0 4 2]


## Run all 'n' guesses

In [13]:
all_0_y = baseline_models.all_0(train_x_df)
print(all_0_y)
all_1_y = baseline_models.all_1(train_x_df)
print(all_1_y)
all_2_y = baseline_models.all_2(train_x_df)
print(all_2_y)
all_3_y = baseline_models.all_3(train_x_df)
print(all_3_y)
all_4_y = baseline_models.all_4(train_x_df)
print(all_4_y)

[0. 0. 0. ... 0. 0. 0.]
[1. 1. 1. ... 1. 1. 1.]
[2. 2. 2. ... 2. 2. 2.]
[3. 3. 3. ... 3. 3. 3.]
[4. 4. 4. ... 4. 4. 4.]


In [14]:
all_pred = baseline_models.run_all_n_guess((train_x_df, train_y_df), (train_x_df, train_y_df), test_x_df)

Running all '0' guess...
'Training' accuracy: 0.027188825143427287
'Validation' accuracy: 0.027188825143427287
Running all '1' guess...
'Training' accuracy: 0.20578697929658268
'Validation' accuracy: 0.20578697929658268
Running all '2' guess...
'Training' accuracy: 0.269227571297913
'Validation' accuracy: 0.269227571297913
Running all '3' guess...
'Training' accuracy: 0.21792633241872453
'Validation' accuracy: 0.21792633241872453
Running all '4' guess...
'Training' accuracy: 0.27987029184335244
'Validation' accuracy: 0.27987029184335244


In [15]:
print(all_pred)

[[0. 0. 0. ... 0. 0. 0.]
 [1. 1. 1. ... 1. 1. 1.]
 [2. 2. 2. ... 2. 2. 2.]
 [3. 3. 3. ... 3. 3. 3.]
 [4. 4. 4. ... 4. 4. 4.]]


### Run Linear Regression

In [16]:
test_pred = baseline_models.linear_regression((train_x_df, train_y_df), (valid_x_df, valid_y_df), test_x_df)

Training accuracy: 0.2631578947368421
Validation accuracy: 0.2663519892110587


In [17]:
print(test_pred)

[[2.]
 [3.]
 [3.]
 ...
 [2.]
 [2.]
 [3.]]


#### Columns:
-    PetID - Unique hash ID of pet profile    
-    AdoptionSpeed - Categorical speed of adoption. Lower is faster. This is the value to predict. See below section for more info.
-    Type - Type of animal (1 = Dog, 2 = Cat)
-    Name - Name of pet (Empty if not named)
-    Age - Age of pet when listed, in months
-    Breed1 - Primary breed of pet (Refer to BreedLabels dictionary)
-    Breed2 - Secondary breed of pet, if pet is of mixed breed (Refer to BreedLabels dictionary)
-    Gender - Gender of pet (1 = Male, 2 = Female, 3 = Mixed, if profile represents group of pets)
-    Color1 - Color 1 of pet (Refer to ColorLabels dictionary)
-    Color2 - Color 2 of pet (Refer to ColorLabels dictionary)
-    Color3 - Color 3 of pet (Refer to ColorLabels dictionary)
-    MaturitySize - Size at maturity (1 = Small, 2 = Medium, 3 = Large, 4 = Extra Large, 0 = Not Specified)
-    FurLength - Fur length (1 = Short, 2 = Medium, 3 = Long, 0 = Not Specified)
-    Vaccinated - Pet has been vaccinated (1 = Yes, 2 = No, 3 = Not Sure)
-    Dewormed - Pet has been dewormed (1 = Yes, 2 = No, 3 = Not Sure)
-    Sterilized - Pet has been spayed / neutered (1 = Yes, 2 = No, 3 = Not Sure)
-    Health - Health Condition (1 = Healthy, 2 = Minor Injury, 3 = Serious Injury, 0 = Not Specified)
-    Quantity - Number of pets represented in profile
-    Fee - Adoption fee (0 = Free)
-    State - State location in Malaysia (Refer to StateLabels dictionary)
-    RescuerID - Unique hash ID of rescuer
-    VideoAmt - Total uploaded videos for this pet
-    PhotoAmt - Total uploaded photos for this pet
-    Description - Profile write-up for this pet. The primary language used is English, with some in Malay or Chinese.