<h1> Problem</h1>

The data for this competition is provided in two files: train.csv and test.csv. The training set has 9557 rows and 143 columns while the testing set has 23856 rows and 142 columns. Each row represents one individual and each column is a feature, either unique to the individual, or for the household of the individual. The training set has one additional column, Target, which represents the poverty level on a 1-4 scale and is the label for the competition. A value of 1 is the most extreme poverty.

<font size="4">This is a supervised multi-class classification machine learning problem.</font>

<h1> Objective</h1>

<font size="4">The objective is to predict poverty on a household level i.e the Target Variable. </font>

<font size="4">The core Data Fields are as follows:
* Id - a unique identifier for each row.<br>
* Target - the target is an ordinal variable indicating groups of income levels.<br>
    * 1 = extreme poverty <br>
    * 2 = moderate poverty <br>
    * 3 = vulnerable households <br> 
    * 4 = non vulnerable households <br>
* idhogar - this is a unique identifier for each household. This can be used to create household-wide features, etc. All rows in a given household will have a matching value for this identifier.<br>
* parentesco1 - indicates if this person is the head of the household.<br>
* This data contains 142 total columns.<br>
    </font>

<font size="4">As how the norm goes we will be training our data on the train dataset and test our model against the test dataset. The Kernel is divided into three major parts</font>

# Part I : Exploratory Data Analysis

# Importing Libraries

<font size="4">This is where the actual fun begins. We start off by importing all the libraries that we will need later on. We will be using Numpy and pandas for data analysis and matplotlib (Matlab for python), seaborn for data visualisation.</font>

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


# Input data files are available in the "../input/" directory.

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [None]:
#Data Visualization
import seaborn as sns 
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
#Loading Data
train= pd.read_csv("../input/train.csv")
test= pd.read_csv("../input/test.csv")
#Displaying the first five rows of the dataset so as to get a feel of the data.
train.head()

That gives us a look at all of the columns which don't appear to be in any order. To get a quick overview of the data we use  .info()

In [None]:
train.info()

This gives us the no of rows and columns as well as the data types present.

To check the count based on groups of income levels from the Target Variable

In [None]:
train['Target'].value_counts()

This gives us the statistical summary of the train dataset.

In [None]:
train.describe()

Now we perform the same for the test as well.

In [None]:
test.info()

In [None]:
test.describe()

As you can tell we have one column less than the training dataset. This is because of the absence of the 'Target' column which is what we are gonna be predicting.

In [None]:
#A plot to visualise the Target Distribution.
sns.countplot('Target',data=train)

From the above plot we can conclude that the data is unbalanced in nature.

We examined how education affected the poverty level of the household. We have a feature called “meaneduc” which is the average amount of education in the family. When we plot this feature against the Target variable we can see that the families the least at risk for poverty  tend to have higher education levels.

In [None]:
from collections import OrderedDict
poverty_mapping = OrderedDict({1: 'extreme', 2: 'moderate', 3: 'vulnerable', 4: 'non vulnerable'})
plt.figure(figsize = (10, 6))
sns.boxplot(x = 'Target', y = 'meaneduc', data = train);
plt.xticks([0, 1, 2, 3], poverty_mapping.values())
plt.title('Average Schooling by Target')

The Household size and how it affected the poverty level of a household was also examined. There is a feature called “overcrowding” which is basically depicts high person per room ratio. This feature was plotted against the Target variable and the resulting plot established the fact that larger the household size the more susceptible it is to poverty.

In [None]:
plt.figure(figsize = (10, 6))
sns.boxplot(x = 'Target', y = 'overcrowding', data = train);
plt.xticks([0, 1, 2, 3], poverty_mapping.values())
plt.title('Overcrowding by Target');

Now we are gonna combine the test and train dataset as this way we can reduce the redundancy of performing the same operations of the train on the test dataset. We will separate them after we clean the data.

In [None]:
#We are doing this because the test doesn't have the Target column.
train2=train.drop('Target',axis=1)

In [None]:
# Appending the data
data = train2.append(test,sort=True)

Let's have a look at the dependancy rate column.

In [None]:
data['dependency'].value_counts()

1. From the above information we can see that the dependancy column has yes and no values.  For this we map the 1's to yes and 0's to no. 

In [None]:
mapping = {"yes": 1, "no": 0}

# Fill in the values with the correct mapping
data['dependency'] = data['dependency'].replace(mapping).astype(np.float64)
data['edjefa'] = data['edjefa'].replace(mapping).astype(np.float64)
data['edjefe'] = data['edjefe'].replace(mapping).astype(np.float64)

data[['dependency', 'edjefa', 'edjefe']].describe()

# Outliers

Outliers are the values which are really from the distribution of the data. We have to remove these outliers as they affect our Model. There is only one outlier in this data i.e on the rez_esc column and acorrding to the answer from competition host(https://www.kaggle.com/c/costa-rican-household-poverty-prediction/discussion/61403), we can safely change the value to 5.

In [None]:
#outlier in test set which rez_esc is 99.0
data.loc[data['rez_esc'] == 99.0 , 'rez_esc'] = 5

# Missing Values

One of the most basic yet important step in EDA is to find the missing values in the data.

In [None]:
# Number of missing in each column
missing = pd.DataFrame(data.isnull().sum()).rename(columns = {0: 'total'})

# Create a percentage missing
missing['percent'] = missing['total'] / len(data)

missing.sort_values('percent', ascending = False).head(10)

The above value displays all the missing values in the data. Now we need to fill this with appopriate values that are derived from a concrete hypothesis.

In [None]:
data['v18q1'] = data['v18q1'].fillna(0)

data.loc[(data['tipovivi1'] == 1), 'v2a1'] = 0
data['v2a1-missing'] = data['v2a1'].isnull()

data.loc[((data['age'] > 19) | (data['age'] < 7)) & (data['rez_esc'].isnull()), 'rez_esc'] = 0
data['rez_esc-missing'] = data['rez_esc'].isnull()

In [None]:
#electricity columns
elec = []

for i, row in data.iterrows():
    if row['noelec'] == 1:
        elec.append(0)
    elif row['coopele'] == 1:
        elec.append(1)
    elif row['public'] == 1:
        elec.append(2)
    elif row['planpri'] == 1:
        elec.append(3)
    else:
        elec.append(np.nan)
        
data['elec'] = elec
data['elec-missing'] = data['elec'].isnull()

In [None]:
#remove already present electricity columns
data = data.drop(columns = ['noelec', 'coopele', 'public', 'planpri'])


In [None]:
#walls ordinal
data['walls'] = np.argmax(np.array(data[['epared1', 'epared2', 'epared3']]),
                           axis = 1)
data = data.drop(columns = ['epared1', 'epared2', 'epared3'])

In [None]:
#roof ordinal
data['roof'] = np.argmax(np.array(data[['etecho1', 'etecho2', 'etecho3']]),
                           axis = 1)
data = data.drop(columns = ['etecho1', 'etecho2', 'etecho3'])

In [None]:
#floor ordinal
data['floor'] = np.argmax(np.array(data[['eviv1', 'eviv2', 'eviv3']]),
                           axis = 1)
data = data.drop(columns = ['eviv1', 'eviv2', 'eviv3'])

In [None]:
#Flushing system
data['flush'] = np.argmax(np.array(data[["sanitario1",'sanitario5', 'sanitario2', 'sanitario3',"sanitario6"]]),
                           axis = 1)
data = data.drop(columns = ["sanitario1",'sanitario5', 'sanitario2', 'sanitario3',"sanitario6"])

In [None]:
#Drop columns with squared variables
data = data[[x for x in data if not x.startswith('SQB')]]
data = data.drop(columns = ['agesq'])

In [None]:
#waterprovision
data['waterprovision'] = np.argmax(np.array(data[['abastaguano', 'abastaguafuera', 'abastaguadentro']]),
                           axis = 1)
data = data.drop(columns = ['abastaguano', 'abastaguafuera', 'abastaguadentro'])

In [None]:
#Education Level
data['inst'] = np.argmax(np.array(data[[c for c in data if c.startswith('instl')]]), axis = 1)
data = data.drop(columns = [c for c in data if c.startswith('instlevel')])


In [None]:
#cooking
data['waterprovision'] = np.argmax(np.array(data[['energcocinar1','energcocinar4', 'energcocinar3', 'energcocinar2']]),
                           axis = 1)
data = data.drop(columns = ['energcocinar1','energcocinar4', 'energcocinar3', 'energcocinar2'])

In [None]:
#meaneduc is defined as average years of education for adults (18+)
data.loc[pd.isnull(data['meaneduc']), 'meaneduc'] = data.loc[pd.isnull(data['meaneduc']), 'escolari']

Splitting the data

In [None]:
train2=data.iloc[0:9557,:]
test2=data.iloc[9557:33413,:]

In [None]:
test2.drop(['Id','idhogar'],axis=1,inplace=True)

Assigning the X which are the features and y which is our Target.

In [None]:
X=train2.drop(['Id','idhogar'],axis=1)

In [None]:
y=train['Target']

# Modeling with XGboost and LightGBM

Here we are gonna use the two best classification models but for the final submission we will use LightGBM as it produces a better score 

# XGBoost

In [None]:
import xgboost as xgb # Importing XGboost Library

In [None]:
xg=xgb.XGBClassifier(n_estimators=200)

In [None]:
xg.fit(X,y)

In [None]:
preds = xg.predict(test2)

Custom Evaluation Metric

In [None]:
def macro_f1_score(
    
    
    labels, predictions):
    # Reshape the predictions as needed
    predictions = predictions.reshape(len(np.unique(labels)), -1 ).argmax(axis = 0)
    
    metric_value = f1_score(labels, predictions, average = 'macro')
    
    # Return is name, value, is_higher_better
    return 'macro_f1', metric_value, True

# LightGBM

In [None]:
# Libraries for LightGBM
import lightgbm as lgb
import sklearn.model_selection as model_selection
from sklearn.metrics import f1_score, make_scorer

In [None]:
lgmodel = lgb.LGBMClassifier(metric = "",num_class = 4)

In [None]:
 hyp_OPTaaS = { 'boosting_type': 'dart',
              'colsample_bytree': 0.9843467236959204,
              'learning_rate': 0.11598629586769524,
              'min_child_samples': 44,
              'num_leaves': 49,
              'reg_alpha': 0.35397370408131534,
              'reg_lambda': 0.5904910774606467,
              'subsample': 0.6299872254632797,
              'subsample_for_bin': 60611}


In [None]:
model = lgb.LGBMClassifier(**hyp_OPTaaS, class_weight = 'balanced',max_depth=-1,objective = 'multiclass', n_jobs = -1, n_estimators = 100)

In [None]:
model.fit(X, y)

# Submission

In [None]:
pred=model.predict(test2)

In [None]:
my_submission = pd.DataFrame({'Id': test.Id, 'Target': pred})
# you could use any filename. We choose submission here
my_submission.to_csv('submission.csv', index=False)