# Costa Rican Household Poverty Level Prediction

Author: Danilo Polidoro

PMR3508

## Getting to know the data

First, let's import the data and get to know it:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
import sklearn
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn import preprocessing
import threading
from sklearn.model_selection import cross_val_score
import warnings
warnings.filterwarnings('ignore')

%config InlineBackend.figure_format = 'svg'
mpl.rcParams['figure.dpi']= 300
pd.set_option('display.expand_frame_repr', False)

In [None]:
costaRica = pd.read_csv('../input/train.csv').dropna()

In [None]:
costaRica.iloc[:,0:14].head()

In [None]:
costaRica.iloc[:,14:28].head()

In [None]:
costaRica.iloc[:,28:42].head()

In [None]:
costaRica.iloc[:,42:56].head()

In [None]:
costaRica.iloc[:,56:70].head()

In [None]:
costaRica.iloc[:,70:84].head()

In [None]:
costaRica.iloc[:,84:98].head()

In [None]:
costaRica.iloc[:,98:112].head()

In [None]:
costaRica.iloc[:,112:126].head()

In [None]:
costaRica.iloc[:,126:144].head()

There are 5 non-numeric features: **Id**, **idhogar**, **dependency**, **edjefa** and **edjefe**. At first, we'll ignore them.

We'll create a Knn classifier, ranging K between 1 and 100, and find the best by cross-validation with cv = 10:

In [None]:
newCosta = costaRica.drop(['Id','idhogar', 'dependency', 'edjefa', 'edjefe'], axis = 1)

In [None]:
accuracies = []
for i in range(100):
    classifier = KNeighborsClassifier(n_neighbors=i+1)
    scores = cross_val_score(classifier, newCosta.iloc[:,0:137],newCosta.iloc[:,137:138], cv = 10)
    accuracies.append(scores.mean())
    print('K = {0}; accuracy = {1}'.format(i+1, scores.mean()))
    
print('')
print('Best classifier at K = {0} with accuracy = {1}'.format(accuracies.index(max(accuracies))+1,max(accuracies)))

With that, we've got a pretty good classifier at K = 9 and with accuracy = 0.8821965452847806.

Can we improve this result?

## Predicting test data

First, let's import the data:

In [None]:
costaRicaTest = pd.read_csv('../input/test.csv')

As we can not lose entries, let's fill all the missing data with '0':

In [None]:
costaRicaTestFill = costaRicaTest.fillna(0)
newCostaRicaTest = costaRicaTestFill.drop(['Id','idhogar', 'dependency', 'edjefa', 'edjefe'], axis = 1)

Let's create the best classifier we found:

In [None]:
classifier = KNeighborsClassifier(n_neighbors=9)
classifier.fit(newCosta.iloc[:,0:137], newCosta.iloc[:,137:138])

Now, we'll predict the results:

In [None]:
prediction = classifier.predict(newCostaRicaTest)

In [None]:
import csv

In [None]:
ids = costaRicaTest.iloc[:,0:1].values.transpose()[0]

In [None]:
csvFile = open('submission.csv', mode = 'w')
csvWriter = csv.writer(csvFile)
csvWriter.writerow(['Id', 'Target'])

In [None]:
for index, element in enumerate(ids):
    csvWriter.writerow([element, prediction[index]])
csvFile.close()