# PetFinder.my Adoption Prediction

COMP-3125 Data Science  
Professor Ergezer   
Mengting Wang, Yen Le  
12/15/2020

## Table of Contents
- [Data Exploration](#Data-Exploration)
    - [Read Data](#Read-Data)
    - [Data Fields](#Data-Fields)
    - [Data Visualization ](#Data-Visualization)
- [Machine Learning Models](#Machine-Learning-Models)
    - [Feature Engineering](#Feature-Engineering)
- [Conclusion](#Conclusion)

## Data Exploration

In [None]:
# required packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn.metrics as metrics
from sklearn.model_selection import train_test_split 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB 
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier

### Read Data

In [None]:
# read data
train = pd.read_csv (r'C:\Users\wangm1\Desktop\DS Final Project\train\train.csv')
test = pd.read_csv (r'C:\Users\wangm1\Desktop\DS Final Project\test\test.csv')
breed = pd.read_csv (r'C:\Users\wangm1\Desktop\DS Final Project\breed_labels.csv')
color = pd.read_csv (r'C:\Users\wangm1\Desktop\DS Final Project\color_labels.csv')
state = pd.read_csv (r'C:\Users\wangm1\Desktop\DS Final Project\state_labels.csv')

# breed
# color
# state
train

In [None]:
# get a summary of data
train.describe()

### Data Fields
Although the dataset is clean and easy to understand, however, some of the column headers are not intuitive.  
Here's what they mean:

1. PetID - Unique hash ID of pet profile
2. AdoptionSpeed - Categorical speed of adoption. Lower is faster. This is the value to predict. 
3. Type - Type of animal (1 = Dog, 2 = Cat)
4. Name - Name of pet (Empty if not named)
5. Age - Age of pet when listed, in months
6. Breed1 - Primary breed of pet (Refer to BreedLabels dictionary)
7. Breed2 - Secondary breed of pet, if pet is of mixed breed (Refer to BreedLabels dictionary)
8. Gender - Gender of pet (1 = Male, 2 = Female, 3 = Mixed, if profile represents group of pets)
9. Color1 - Color 1 of pet (Refer to ColorLabels dictionary)
10. Color2 - Color 2 of pet (Refer to ColorLabels dictionary)
11. Color3 - Color 3 of pet (Refer to ColorLabels dictionary)
12. MaturitySize - Size at maturity (1 = Small, 2 = Medium, 3 = Large, 4 = Extra Large, 0 = Not Specified)
13. FurLength - Fur length (1 = Short, 2 = Medium, 3 = Long, 0 = Not Specified)
14. Vaccinated - Pet has been vaccinated (1 = Yes, 2 = No, 3 = Not Sure)
15. Dewormed - Pet has been dewormed (1 = Yes, 2 = No, 3 = Not Sure)
16. Sterilized - Pet has been spayed / neutered (1 = Yes, 2 = No, 3 = Not Sure)
17. Health - Health Condition (1 = Healthy, 2 = Minor Injury, 3 = Serious Injury, 0 = Not Specified)
18. Quantity - Number of pets represented in profile
19. Fee - Adoption fee (0 = Free)
20. State - State location in Malaysia (Refer to StateLabels dictionary)
21. RescuerID - Unique hash ID of rescuer
22. VideoAmt - Total uploaded videos for this pet
23. PhotoAmt - Total uploaded photos for this pet
24. Description - Profile write-up for this pet. The primary language used is English, with some in Malay or Chinese.

#### Adoption Speed
The value is determined by how quickly, if at all, a pet is adopted. The values are determined in the following way:

0 - Pet was adopted on the same day as it was listed.  
1 - Pet was adopted between 1 and 7 days (1st week) after being listed.  
2 - Pet was adopted between 8 and 30 days (1st month) after being listed.  
3 - Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed.  
4 - No adoption after 100 days of being listed. (There are no pets in this dataset that waited between 90 and 100 days).

### Data Visualization


In [None]:
plt.title('Species', fontsize='xx-large')
train['Type'].value_counts().rename({1:'Dogs',2:'Cats'}).plot(kind='barh')
plt.xlabel('Count')

In [None]:
plt.title('Adoption Speed', fontsize='xx-large')
train['AdoptionSpeed'].value_counts().rename(
    {0:'Same Day',
     1:'1-7 Days',
     2:'8-30 Days',
     3:'31-90 Days',
     4:'No adoption after 100 Days'}).plot(kind='barh')
plt.xlabel('Count')

## Machine Learning Models

### Feature Engineering

There are some features are not as relevant for predicting the adoption speed, such as a pet's name, resourceID, petID and description. In addition, most of the values in Breed2 and Color3 are 0 ( unknown), which means they do not provide any significant information in the prediction of the adoption speed as well.   
Thus, we can remove these features.

In [None]:
drop = ['Name', 'Breed2','Color3', 'RescuerID', 'PetID', 'Description']
train = train.drop(drop, axis = 1)
# test  = test.drop(drop, axis = 1)

X = train.drop(['AdoptionSpeed'], axis = 1)
y = train.AdoptionSpeed

#Train-Test Split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.25, random_state=0)

# confirm split
print('Training Features Shape:', Xtrain.shape)
print('Training Labels Shape:', ytrain.shape)
print('Testing Features Shape:', Xtest.shape)
print('Testing Labels Shape:', ytest.shape)

###  Approach one - KNeighborsClassifier 

In [None]:
model = KNeighborsClassifier(n_neighbors=1)
model.fit(Xtrain, ytrain)

# Test it on test data
y_model = model.predict(Xtest)

# accuracy score
accuracy_score(ytest, y_model)

In [None]:
# Adoption speed distribution
ytest.value_counts().plot.bar()

In [None]:
mat = confusion_matrix(ytest, y_model)

sns.heatmap(mat, square=True, annot=True, cbar=False, cmap='YlGnBu') #flag, YlGnBu, jet
plt.xlabel('predicted value')
plt.ylabel('true value');

###  Approach two - Naive Bayes

In [None]:
model = GaussianNB()                       # instantiate model
model.fit(Xtrain, ytrain)                  # fit model to data
y_model = model.predict(Xtest)             # predict on new data/ apply model to the test data

#3. Test it on test data.
accuracy_score(ytest, y_model)

In [None]:
from sklearn.metrics import cohen_kappa_score
cohen_kappa_score(ytest, y_model, weights = "quadratic")

###  Apporach Three - Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='lbfgs', multi_class='auto',max_iter = 15000)
model.fit(Xtrain,ytrain)

# Make predictions on test data
y_model = model.predict(Xtest)

# accuracy score
accuracy_logistic = accuracy_score(ytest, y_model)
print(accuracy_logistic)


###  Approach Four - Random Forest Classifier

In [None]:
model = RandomForestClassifier(n_estimators=150)
model.fit(Xtrain,ytrain)

# Make predictions on test data
y_model = model.predict(Xtest)

# accuracy score
accuracy_score(ytest, y_model)

###  Approach Five - Random Forest Regressor

In [None]:
model = RandomForestRegressor(n_jobs=-1)

# Try different numbers of n_estimators
estimators = np.arange(10, 200, 10)
scores = []
for n in estimators:
    model.set_params(n_estimators=n)
    model.fit(Xtrain, ytrain)
    scores.append(model.score(Xtest, ytest))
plt.title("Effect of n_estimators")
plt.xlabel("n_estimator")
plt.ylabel("score")
plt.plot(estimators, scores)

According to the graph, we decided to chooe 150 as the estimator

In [None]:
model = RandomForestRegressor(n_estimators=150)
model.fit(Xtrain,ytrain)

# Make predictions on test data
y_model = model.predict(Xtest)

# accuracy score
print('Mean Absolute Error (MAE):', metrics.mean_absolute_error(ytest, y_model))
print('Mean Squared Error (MSE):', metrics.mean_squared_error(ytest, y_model))
print('Root Mean Squared Error (RMSE):', np.sqrt(metrics.mean_squared_error(ytest, y_model)))


## Conclusion