## Error Detection Challenge
### Overview:
             Handwritten data is entered into an Access database and at the end of the year review of each rescue/field is done to compare the database entry with the hand-written logbook to ensure the integrity(similarity) of the data. The objective of this challenge is to process the data in the turtle rescue database and create a machine learning model to help identify potential errors and anomalies in the turtle rescue database (assign a probability that any given field has been entered erroneously from the logbook into the database). The dirty_data.csv contain data entered with errors and cleaned_data is the dirty_data after removing the errors.

In [None]:
#importing necessary libraries
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import os
from sklearn.preprocessing import LabelBinarizer
%matplotlib inline

In [None]:
os.getcwd()

In [None]:
path = '../sea_turtle_challenge/data'

## Data preparation and exploration.
Here i use **latin-1** as my encoding to walk-around the 
'can't decode'unicode error that occurs on usage of the usual 
pd.read_csv('**<'dataset'>.csv**') method. The reason is that the files may not be 
in real csv format but instead html format.
the **cp1252** encoding could as well solve the issue

In [None]:
dirty_data = pd.read_csv(path + '/dirty_data.csv', encoding='latin-1')
clean_data = pd.read_csv(path + '/cleaned_data.csv', encoding='latin-1')
test_data = pd.read_csv(path + '/test_data.csv', encoding='latin-1')

In [None]:
# previewing the first few dirty data records
dirty_data.head()

In [None]:
# previewing the first few cleaned data records
clean_data.head()

In [None]:
#filling the NaN columns with 1's to indicate no errors
dirty_data.fillna(0, inplace=True)
clean_data.fillna(0, inplace=True)
test_data.fillna(0, inplace=True)

### Target variables
Since the model is going to be trained on both clean and dirty data, we have to generate the target variables *(errors of the respective columns)* by stacking the clean and dirty dataframes side by side

In [None]:
#generating targets values (errors)
targets = dirty_data.where(dirty_data.values==clean_data.values)
targets.head()

In [None]:
#filling the new NaN targets to represent 1 (error)
targets.fillna(1, inplace=True)

#dropping the rescue_ID column as it tells nothing about the error
targets.drop('Rescue_ID', axis=1, inplace=True)

*replacing non-1 entries in the targets dataframe as 0 to indicate no error since they match in both the dirty and cleaned data*

In [None]:
targets = targets.replace(dirty_data.where(dirty_data.values==clean_data.values), 0)

In [None]:
targets.head()

In [None]:
dirty_data.info()

In [None]:
dirty_data["Date_Caught"] = pd.to_datetime(dirty_data.Date_Caught)
test_data["Date_Caught"] = pd.to_datetime(test_data.Date_Caught)

In [None]:
dirty_data["year"] = dirty_data["Date_Caught"].dt.year
test_data["year"] = test_data["Date_Caught"].dt.year

In [None]:
#dropping the Id columns in the train and test data
test_data.drop(["Rescue_ID", "Date_Caught"], axis=1, inplace=True)
dirty_data.drop(["Rescue_ID", "Date_Caught"], axis=1, inplace=True)

#training features and labels
features = dirty_data
labels = targets

In [None]:
labels = labels.rename(columns={"Date_Caught":"year"})

In [None]:
features.head()

In [None]:
labels.head()

In [None]:
""" applying transformations to the dirty and test data i.e. text to int conversion """
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer()
"""learning vocabulary of the training data"""
vect.fit(features)

In [None]:
#examining fitted vocabulary
vect.get_feature_names()

In [None]:
#transforming training data into a document-term matrix
feats = vect.transform(features)

In [None]:
#converting the sparse-matrix to a dense-matrix
feats.toarray()

In [None]:
#examining the vocabulary and document-term matrix together
new_features = pd.DataFrame(feats.toarray(), columns = vect.get_feature_names())
new_features.head()

In [None]:
#using labelpowerset
from skmultilearn.problem_transform import LabelPowerset
from sklearn.naive_bayes import GaussianNB

classifier = LabelPowerset(GaussianNB())
#train the model
classifier.fit(new_features, labels)

In [None]:
#making predictions with the model
preds = classifier.predict(new_features)

In [None]:
#model evaluation
from sklearn.metrics import accuracy_score, mean_absolute_error, confusion_matrix
accuracy_score(labels, preds)

In [None]:
#predictions
preds = clf.predict(new_features)

In [None]:
accuracy_score(labels, preds)

### Training the model

In [None]:
#importing the model and scoring metrics from the sklearn library
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

In [None]:
model = RandomForestClassifier(n_estimators=100, criterion='mae', n_jobs=-1, random_state=42)

In [None]:
model.fit(new_features, labels)

In [None]:
""" checking the shape of our test data """

test_data.shape

In [None]:
test_data.head()