### Summary
This exercise uses a binary classifcation data set called Food_Classifier.csv to predict labels (food and non-food).
Sinse the data set is labelled, supervised learning data mining is used to infer labeled training data. 

A text classification is performed using tf-idf (term frequency-inverse document frequency) approach that weighs the importance of features in a document and also in a collection of documents.

Finally random forest classifier is used to get predictions. 
Alternatively Support Vector Machines (SVM) was also explored to check the consistency of the two predictive modeling.

### Import Packages

In [1]:
import pandas as pd
import numpy as np
import nltk 
#nltk.download('popular') # Use this to download all popular nltk data
import sklearn
from nltk.corpus import stopwords
from sklearn.ensemble import RandomForestClassifier 
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import svm

pd.set_option('display.max_colwidth', -1)


### Functions

#### The functions below are created to input new test data. These are used to ouput model performance metrics such as accuracy and confusion matrix. 

In [2]:

def food_classifier(filename, vectorizer):
    # Read data
    raw_expense = pd.read_csv(filename,encoding = "ISO-8859-1")
    
    #Replace Nan with blanks
    raw_expense = raw_expense.replace(np.nan, '', regex=True)
    
    #Join columns
    raw_expense['expense_text'] = raw_expense['type'] +' '+ raw_expense['description']
    
    #Extract columns
    cleaned_expense = raw_expense[['expense_text', 'is_food']]

    # Replace white space with NA
    # Drop NA values
    cleaned_expense = cleaned_expense.replace(r'', np.nan, regex=True).dropna().reset_index(drop = True)
    
    # Change text to lower case
    cleaned_expense['expense_text'] = cleaned_expense['expense_text'].apply(lambda x: " ".join(x.lower() for x in x.split()))
    
    # Replace special charaters with white space
    cleaned_expense['expense_text'] = cleaned_expense['expense_text'].str.replace('[^\w\s]',' ')
    
    # Remove stop words
    stop_words = stopwords.words('english')
    
    cleaned_expense['expense_text']  = cleaned_expense['expense_text'] .apply(lambda x: " ".join(x for x in x.split() if x not in stop_words))

    # Lemmatize text
    cleaned_expense['expense_text'] = cleaned_expense['expense_text'].apply(lambda row: nltk.stem.WordNetLemmatizer().lemmatize(row))

    # Split data to train and split
    # Data is impalanced- Use stratified sampling

    X,Y = cleaned_expense.drop('is_food',axis=1), cleaned_expense['is_food']
    Y = Y.apply(lambda x: 1 if x == 'food' else 0)
    X = X.values
    X = pd.DataFrame(X)
    
    # Convert train and test text data to pandas data frame
    # NOTE: TfidfVectorizer is used to 
    # convert a collection of raw documents to a matrix of TF-IDF features.
    # output is a term-document matrix

    X_tfidf = vectorizer.transform(X[0])
    
    return(X_tfidf, Y.values)
    

def rf_classifier(classifier, X_Test, Y_Test):
    
    pred = classifier.predict(X_Test)
    
    acc = accuracy_score(Y_Test, pred)
    
    conf = confusion_matrix(Y_Test, pred)
    
    return acc,conf



### Data exploration and building models

In [3]:
# Read file
# Set encoding for text: Text contains numbers, upper and lowercase English letters
raw_expense = pd.read_csv('Food_Classifier.csv',encoding = "ISO-8859-1")
raw_expense.head(5)

Unnamed: 0.1,Unnamed: 0,type,description,country,amount,amount_per_day,claim_type,is_food
0,0,Lunch when at NON client site,"Lunch at BBC Millbank, non-client site",United Kingdom,4.823,4.823,Receipt,food
1,1,Parking,,Brazil,2.6558,2.6558,Receipt,non-food
2,2,Parking,,Italy,1.1999,1.1999,Receipt,non-food
3,3,Other Expenses- Taxi,TAKSI,Turkey,15.9489,1.1392,Receipt,non-food
4,4,Onsite/offsite support - FB,,China,13.4651,13.4651,Receipt,non-food


In [4]:
# Make copy of raw data
copy_expense = raw_expense.copy()

# Get absolute value of amount data
copy_expense['amount_per_day'] = abs(copy_expense['amount_per_day'])

# Get amount value greater than 0
copy_expense['amount_per_day'] = copy_expense['amount_per_day'][copy_expense['amount_per_day'] > 0]

# Get min and max values
copy_expense.groupby('is_food').agg({'amount_per_day': [min, max]})


Unnamed: 0_level_0,amount_per_day,amount_per_day
Unnamed: 0_level_1,min,max
is_food,Unnamed: 1_level_2,Unnamed: 2_level_2
food,0.0109,1150.5644
non-food,0.0032,10008.1581


In [5]:
copy_expense['claim_type'].unique()

array(['Receipt', 'Per diem'], dtype=object)

In [6]:
# Replace Nan with blanks
raw_expense = raw_expense.replace(np.nan, '', regex=True)

# Join columns
raw_expense['expense_text'] = raw_expense['type'] +' '+ raw_expense['description']


In [7]:
# Extract columns
cleaned_expense = raw_expense[['expense_text', 'is_food']]

# Number of observations
cleaned_expense.shape[0]

56211

In [8]:
# Replace white space with NA
# Drop NA values

cleaned_expense = cleaned_expense.replace(r'', np.nan, regex=True).dropna().reset_index(drop = True)

# Number of observations
cleaned_expense.shape[0]



55141

In [9]:
# Pre-processing textual data

# Change to lower case
# Splits text to different characters, converts to lower and joins them back

cleaned_expense['expense_text'] = cleaned_expense['expense_text'].apply(lambda x: " ".join(x.lower() for x in x.split()))



In [10]:
# Replace special characters with white space

cleaned_expense['expense_text'] = cleaned_expense['expense_text'].str.replace('[^\w\s]',' ')

# Check first 5 rows
cleaned_expense.head()

Unnamed: 0,expense_text,is_food
0,lunch when at non client site lunch at bbc millbank non client site,food
1,parking,non-food
2,parking,non-food
3,other expenses taxi taksi,non-food
4,onsite offsite support fb,non-food


In [11]:
# Remove stopwords
# Splits text to different characters, remove stop words and joins them back

stop_words = stopwords.words('english')

cleaned_expense['expense_text']  = cleaned_expense['expense_text'] .apply(lambda x: " ".join(x for x in x.split() if x not in stop_words))


In [12]:
# Lemmatize text, arrives at a common root form of the word

cleaned_expense['expense_text'] = cleaned_expense['expense_text'].apply(lambda x: " ".join(nltk.stem.WordNetLemmatizer().lemmatize(x) for x in x.split()))



In [13]:
# Check first 5 rows
cleaned_expense.head()

Unnamed: 0,expense_text,is_food
0,lunch non client site lunch bbc millbank non client site,food
1,parking,non-food
2,parking,non-food
3,expense taxi taksi,non-food
4,onsite offsite support fb,non-food


In [14]:
# Check porportions of is_food

cleaned_expense.groupby(['is_food']).size()/len(cleaned_expense)

is_food
food        0.28059
non-food    0.71941
dtype: float64

In [15]:
# Split data to train and split
# Data is impalanced- Use stratified sampling

# Extract text and label data
# Assign to X and Y
X,Y = cleaned_expense.drop('is_food',axis=1), cleaned_expense['is_food']

# Classify labels as 0 and 1
# In this, food is 1 and non food is 0
Y = Y.apply(lambda x: 1 if x == 'food' else 0)

#Convert to arrays
X,Y = X.values,Y.values

# X : feature text
# Y : labels

#70-30 data split
X_train, X_val, Y_train, Y_val = train_test_split(X, Y,
                                                    test_size=0.30,
                                                    random_state = 42,
                                                    stratify=Y)


In [16]:
# Number of observations
X_train.shape[0]

38598

In [17]:
# Number of observations
X_val.shape[0]

16543

In [18]:
# Convert train and test text data to pandas data frame
X_train = pd.DataFrame(X_train)

X_val = pd.DataFrame(X_val)

# Use tfidf vectorizer for feature extraction
# NOTE: TfidfVectorizer is used to 
# convert a collection of raw documents to a matrix of TF-IDF features.
# Output is a term-document matrix

# Call tfidf vectorizer function
vectorizer = TfidfVectorizer()

#apply tfids vectorizer on train data
X_train_tfidf = vectorizer.fit_transform(X_train[0])

# Check matrix dimensions
X_train_tfidf.shape

(38598, 10164)

In [19]:
#Use fitted vectorizer on the testing data
X_val_tfidf = vectorizer.transform(X_val[0])

# Check matrix dimensions
X_val_tfidf.shape

(16543, 10164)

In [20]:
# Random forest classifier model
# No of estimators is no of trees the forest should have
# Minimum sample split in miminum number of sample maintained in each node
classifier = RandomForestClassifier(n_estimators = 500, min_samples_split=100, criterion = "gini", random_state = 42)

# Fit classifier to training set
classifier = classifier.fit(X_train_tfidf, Y_train)

# Predict on test data using the fitted classifier
pred = classifier.predict(X_val_tfidf)

# Check for accuracy between the actual and precited labels
acc = accuracy_score(Y_val, pred)

# Create a confusion matrix
conf = confusion_matrix(Y_val, pred)

tn, fp, fn, tp = conf.ravel()

# Estimate precision and recall
precision = tp/(tp+fp)
recall = tp/(tp+fn)

print('Accuracy: {}%'.format(round(acc*100,4)))
print('True Negatives: {}'.format(tn))
print('False Postives: {}'.format(fp))
print('False Negatives: {}'.format(fn))
print('True Postives: {}'.format(tp))
print('Precision: {}%'.format(round(precision*100,4)))
print('Recall: {}%'.format(round(recall*100,4)))

Accuracy: 99.7763%
True Negatives: 11882
False Postives: 19
False Negatives: 18
True Postives: 4624
Precision: 99.5908%
Recall: 99.6122%


In [21]:
# Support Vector Machine classifier
classifier = svm.SVC(gamma=0.01, C=100., random_state = 42)

# Fit classifier to training set
classifier = classifier.fit(X_train_tfidf, Y_train) 

# Predict on test data using the fitted classifier
pred = classifier.predict(X_val_tfidf)

# Check for accuracy between the actual and precited labels
acc = accuracy_score(Y_val, pred)

# Create a confusion matrix
conf = confusion_matrix(Y_val, pred)

tn, fp, fn, tp = conf.ravel()

# Estimate precision and recall
precision = tp/(tp+fp)

recall = tp/(tp+fn)

print('Accuracy: {}%'.format(round(acc*100,4)))
print('True Negatives: {}'.format(tn))
print('False Postives: {}'.format(fp))
print('False Negatives: {}'.format(fn))
print('True Postives: {}'.format(tp))
print('Precision: {}%'.format(round(precision*100,4)))
print('Recall: {}%'.format(round(recall*100,4)))

Accuracy: 99.6373%
True Negatives: 11872
False Postives: 29
False Negatives: 31
True Postives: 4611
Precision: 99.375%
Recall: 99.3322%


### For Testing New Data

#### The functions created earlier in the file are called here to print model performance metrics for the new test data.
#### The functions defined earlier must be run to avoid errors below.
##### ***Note : These will be left uncommented until new data is input.***

In [22]:
# #Input new test data
# filename='Food_Classifier.csv'

# #Get the new tfidf and is_food labels from new data
# X_tfidf_new, Y_new = food_classifier(filename, vectorizer)

In [23]:
# # Call the new tfidf matrix and is food labels here
# X_Test,Y_Test = X_tfidf_new, Y_new

# # Run Randomforest model on the new data
# accuracy, conf_matrix = rf_classifier(classifier, X_Test, Y_Test)
# tn, fp, fn, tp = conf_matrix.ravel()
# precision = tp/(tp+fp)
# recall = tp/(tp+fn)


In [24]:
# print('Accuracy: {}%'.format(round(acc*100,4)))
# print('True Negatives: {}'.format(tn))
# print('False Postives: {}'.format(fp))
# print('False Negatives: {}'.format(fn))
# print('True Postives: {}'.format(tp))
# print('Precision: {}%'.format(round(precision*100,4)))
# print('Recall: {}%'.format(round(recall*100,4)))

### References
- http://scikit-learn.org/
- https://towardsdatascience.com/
- https://www.garysieling.com/blog/sklearn-gini-vs-entropy-criteria
- http://scikit-learn.org/stable/supervised_learning.html
- https://stackoverflow.com/questions/18171739/unicodedecodeerror-when-reading-csv-file-in-pandas-with-python
- http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex8/ex8.html
