**Module name**: readTrainingData 

**Parameters**: None

**Description**: 
Reads the train.csv and separates the categorical and continuous data.
This module also removes the target *loss* values from the training data which is required as an input during training the predictive model.

**Return values**:

 - **categories**: The categorical values read from the data
 - **continuous**: The continuous data read from the data
 - **target**: The ***loss*** values extracted from data
 - **data**: A dictionary of ***id*** as keys and the rest of the row of data as values (not used anywhere, did just in case if required at any time)

In [None]:
import csv
def readTrainingData():
    data = {}
    colNames = []
    categories = []
    continuous = []
    target = []
    target1 = []
    with open('../input/train.csv') as csvfile:
        trainReader = csv.reader(csvfile, delimiter=',')
        count = 0
        for row in trainReader:
            if count == 0:
                colNames = row
                count+=1
            else:
                key = int(row[0])
                row.pop(0)
                categories.append(row[0:116])
                continuous1 = row[116:130]
                target1.extend(row[130:131])
                idxToBeDeleted = len(row) - 1
                row.pop(idxToBeDeleted)
                continuous1 = [ float(x) for x in continuous1 ]
                continuous.append(continuous1)
                data[key] = row
        for item in target1:
            target.append(float(item))
    return data, categories,continuous,target

Below cell is just used to call the ***readTrainingData*** module

In [None]:
data, categories,continuous,target = readTrainingData()
print ("readTrainingData done")

**Module name**: categoryEncoder

**Parameters**: 

 - **categories**: Categories values extracted during **readTrainingData**
 - **continuous**: Continuous values extracted during **readTrainingData**

**Description**: 
This is the most important module since it converts the training categorical data into continuous ones. This is how its done:

 - Every unique category in the training data, is assigned a unique ***label*** 

 - Now this ***label*** is replaced in the place of categorical values in a 2D list called encodedCategories.

 - A dictionary of unique categories along with their labels is maintained to use during encoding of test data categories

The continuous and encoded categorical values are zipped together to resemble the training data

**Return values**:

 - **encodedCategories**: The encoded categorical values
 - **uniqCatAndCounts**: A dictionary of categories and its labels required during test data categories encoding.
 

In [None]:
def categoryEncoder(categories,continuous): 
    uniqCatAndCounts = {}
    numRows = len(categories)
    numCols = len(categories[0])
    encodedCategories = [[0 for x in range(numCols)] for y in range(numRows)] 
    for i in range(numRows):
        for j in range(numCols):
            if categories[i][j] in uniqCatAndCounts:
                uniqCatAndCounts[categories[i][j]] += 1
            else:
                uniqCatAndCounts[categories[i][j]] = 1
    label = 1
    for cat in uniqCatAndCounts:
        uniqCatAndCounts[cat] = label
        label += 1
    for i in range(numRows):
        for j in range(numCols):
            if categories[i][j] in uniqCatAndCounts:
                encodedCategories[i][j] = float(uniqCatAndCounts[categories[i][j]])
        encodedCategories[i].extend(continuous[i])
    return encodedCategories,uniqCatAndCounts

encodedCategories,uniqCatAndCounts = categoryEncoder(categories,continuous)
print ("categoryEncoder done")

**Module name**: readTestData

**Parameters**: None

**Description**: Reads the test.csv and separates the categorical and continuous data.
This module separates the test ids which is needed in producing final predicted values.


**Return values**:

 - **categories**: The categorical values read from the data
 - **continuous**: The continuous data read from the data
 - **testIds**: The ***id*** of each row in the test data
 - **data**: A dictionary of ***id*** as keys and the rest of the row of data as values (not used anywhere, did just in case if required at any time)
 

In [None]:
import csv
def readTestData():
    data = {}
    testIds = []
    colNames = []
    categories = []
    continuous = []
    count = 0
    with open('../input/test.csv') as csvfile:
        testReader = csv.reader(csvfile, delimiter=',')
        for row in testReader:
            if count == 0:
                colNames = row
                count+=1
            else:
                key = int(row[0])
                testIds.append(int(row.pop(0)))
                categories.append(row[0:116])
                continuous1 = row[116:130]
                continuous1 = [ float(x) for x in continuous1 ]
                continuous.append(continuous1)
                data[key] = row
    return data, categories, continuous, testIds

dataTest, categoriesTest, continuousTest, testIds = readTestData()
print("readTestData done")

**Module name**: testCatEncoder

**Parameters**: 

 - **categoriesTest**: Categories values extracted during **readTestData**
 - **continuousTest**: Continuous values extracted during **readTestData**
 - **uniqCatAndCounts**: The categories and their labels read from **readTrainingData**

**Description**: 
This is another important module since it converts the test categorical data into continuous ones. This is how its done:

 - Every category in the test data, is replaced by the ***label*** read and stored in  ***uniqCatAndCounts*** during ***readTrainingData***.

 - If there is a category in the test data which was not in train data, that category is assigned a new label and is stored in ***uniqCatAndCounts*** 

The continuous and encoded categorical values are zipped together to resemble the test data

**Return values**:

 - **encodedCategoriesTest**: The encoded categorical values from test data
 - **uniqCatAndCounts**: An updated dictionary of categories and its labels

In [None]:
def testCatEncoder(categoriesTest, uniqCatAndCounts, continuousTest):
    #encodedCategoriesTest = categoriesTest
    numRows = len(categoriesTest)
    numCols = len(categoriesTest[0])
    label = len(uniqCatAndCounts) + 1
    encodedCategoriesTest = [[0 for x in range(numCols)] for y in range(numRows)] 
    for i in range(numRows):
        for j in range(numCols):
            if categoriesTest[i][j] in uniqCatAndCounts:
                encodedCategoriesTest[i][j] = float(uniqCatAndCounts[categoriesTest[i][j]])
            else:
                #Should update uniqCatAndCounts with new key and new value
                uniqCatAndCounts[categoriesTest[i][j]] = label
                encodedCategoriesTest[i][j] = label
                label += 1
        encodedCategoriesTest[i].extend(continuousTest[i])
    return uniqCatAndCounts, encodedCategoriesTest

uniqCatAndCounts, encodedCategoriesTest = testCatEncoder(categoriesTest, uniqCatAndCounts, continuousTest)
print("testCatEncoder done")

In [None]:
#Preparing model with required parameters
import numpy as np
from xgboost import XGBRegressor

seed = 0
n_estimators = 1000

best_model = XGBRegressor(n_estimators=n_estimators,seed=seed)

In [None]:
splitIdx = int(0.2*len(encodedCategories))

from sklearn.metrics import mean_absolute_error

#Split the data into 80 and 20 percent
encodedCategories_X_train = encodedCategories[:-splitIdx]
encodedCategories_X_test = encodedCategories[-splitIdx:]

#Applying log transformation to reduce the bias of the target values
logTransformedLoss = list(np.log(target))

# Split the targets into training/testing sets
target_y_train = logTransformedLoss[:-splitIdx]
target_y_test = logTransformedLoss[-splitIdx:]

fit2 = best_model.fit(encodedCategories_X_train, target_y_train)

#fetch mean absolute error required for measuring the accuracy in the current contest
mean_absolute_error(np.exp(fit2.predict(encodedCategories_X_test)), np.exp(target_y_test))

In [None]:
testPredictions = np.exp(fit2.predict(encodedCategoriesTest))

In [None]:
#Write the predicted values into csv file along with their ids
import csv
testIdsWithPredictions = zip(testIds, testPredictions)
with open('result.csv', 'w') as out:
    csv_out=csv.writer(out)
    csv_out.writerow(['id','loss'])
    for row in testIdsWithPredictions:
        csv_out.writerow(row)