# Assignment 4 - Data Set Description
The questions below relate to the data files associated with the contest with the title 'DengAI: Predicting Disease Spread' published at the following website. 
https://www.drivendata.org/competitions/44/dengai-predicting-disease-spread/data/

Anyone can join the contest and showcase your skills. To know about contest submissions visit the following webpage
https://www.drivendata.org/competitions/44/dengai-predicting-disease-spread/submissions/
You can showcase your Machine Learning skills by ranking top in the contest. 

Problem description:
Your goal is to predict the total_cases label for each (city, year, weekofyear) in the test set. There are two cities, San Juan and Iquitos, with test data for each city spanning 5 and 3 years respectively. You will make one submission that contains predictions for both cities. The data for each city have been concatenated along with a city column indicating the source: sj for San Juan and iq for Iquitos. The test set is a pure future hold-out, meaning the test data are sequential and non-overlapping with any of the training data. Throughout, missing values have been filled as NaNs.

Assignment:
The goal is achieved through three subsequent Assignments 1, 2, 3 and 4, all using the same dataset


The features in this dataset
You are provided the following set of information on a (year, weekofyear) timescale:

(Where appropriate, units are provided as a _unit suffix on the feature name.)

City and date indicators

    city – City abbreviations: sj for San Juan and iq for Iquitos
    week_start_date – Date given in yyyy-mm-dd format

NOAA's GHCN daily climate data weather station measurements

    station_max_temp_c – Maximum temperature
    station_min_temp_c – Minimum temperature
    station_avg_temp_c – Average temperature
    station_precip_mm – Total precipitation
    station_diur_temp_rng_c – Diurnal temperature range
    
PERSIANN satellite precipitation measurements (0.25x0.25 degree scale)

    precipitation_amt_mm – Total precipitation

NOAA's NCEP Climate Forecast System Reanalysis measurements (0.5x0.5 degree scale)

    reanalysis_sat_precip_amt_mm – Total precipitation
    reanalysis_dew_point_temp_k – Mean dew point temperature
    reanalysis_air_temp_k – Mean air temperature
    reanalysis_relative_humidity_percent – Mean relative humidity
    reanalysis_specific_humidity_g_per_kg – Mean specific humidity
    reanalysis_precip_amt_kg_per_m2 – Total precipitation
    reanalysis_max_air_temp_k – Maximum air temperature
    reanalysis_min_air_temp_k – Minimum air temperature
    reanalysis_avg_temp_k – Average air temperature
    reanalysis_tdtr_k – Diurnal temperature range

Satellite vegetation - Normalized difference vegetation index (NDVI) - NOAA's CDR Normalized Difference Vegetation Index (0.5x0.5 degree scale) measurements

    ndvi_se – Pixel southeast of city centroid
    ndvi_sw – Pixel southwest of city centroid
    ndvi_ne – Pixel northeast of city centroid
    ndvi_nw – Pixel northwest of city centroid

# Assignment 4 - Questions
Use the merged data frame from Assignment 1,2  and 3 for this assignment

This Assignment focuses on data preprocessing and model building. Continue with the datasets loaded in Assignment 1, 2 and 3 (or reload with same steps and create merged data frame). In this assignment you need to use Neural Network

1. Load the data (both features and label data set as before)
2. Preprocess the data - briefly comment if any special preprocessing is adopted to suit Neural Network
3. Optional: Build a Neural Network Multi-Layer Perceptron Regressor model (you can use sklearn neural network MLP Regressor)
4. Optional: Evaluate the model and compare it with the previous three assignments
5. Add a new column called 'above_average' with value 1 or 0. 1 if the total_cases > median of total_cases
6. Build a Neural Network MLP Classifier on the 'above_average' column with 80/20 train/test split
7. Explain the meaning of Precision, Recall and F1-Score and why these are used to evaluate Classification models (instead of using Accuracy as a metric). Evaluate the classifier using Precision, Recall and F1 score values

Submit the .ipynb, and .html (optional submission.csv if you performed MLP regressor)


In [1]:
#common
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns
from matplotlib.pyplot import figure
%matplotlib inline

#sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPRegressor, MLPClassifier
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import classification_report

from sklearn.metrics import precision_score, recall_score, f1_score


# Ignore useless warnings
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")





Bad key "text.kerning_factor" on line 4 in
/anaconda3/lib/python3.7/site-packages/matplotlib/mpl-data/stylelib/_classic_test_patch.mplstyle.
You probably need to get an updated matplotlibrc file from
https://github.com/matplotlib/matplotlib/blob/v3.1.3/matplotlibrc.template
or from the matplotlib source distribution


In [2]:
df = pd.read_csv('dengue_features_train.csv')

In [3]:
#Changing the type to categorical value
df.year = df.year.astype('category')
df.city = df.city.astype('category')

In [4]:
#Abbreviating the column names  
d = {'station': 'stn', 'reanalysis': 're_an','humidity': 'hd','precipitation':'prec'}

def replace_all(text, dic):
    for i, j in dic.items():
        text.columns = text.columns.str.replace(i, j)
    return text

replace_all(df,d)

Unnamed: 0,city,year,weekofyear,week_start_date,ndvi_ne,ndvi_nw,ndvi_se,ndvi_sw,prec_amt_mm,re_an_air_temp_k,...,re_an_precip_amt_kg_per_m2,re_an_relative_hd_percent,re_an_sat_precip_amt_mm,re_an_specific_hd_g_per_kg,re_an_tdtr_k,stn_avg_temp_c,stn_diur_temp_rng_c,stn_max_temp_c,stn_min_temp_c,stn_precip_mm
0,sj,1990,18,1990-04-30,0.122600,0.103725,0.198483,0.177617,12.42,297.572857,...,32.00,73.365714,12.42,14.012857,2.628571,25.442857,6.900000,29.4,20.0,16.0
1,sj,1990,19,1990-05-07,0.169900,0.142175,0.162357,0.155486,22.82,298.211429,...,17.94,77.368571,22.82,15.372857,2.371429,26.714286,6.371429,31.7,22.2,8.6
2,sj,1990,20,1990-05-14,0.032250,0.172967,0.157200,0.170843,34.54,298.781429,...,26.10,82.052857,34.54,16.848571,2.300000,26.714286,6.485714,32.2,22.8,41.4
3,sj,1990,21,1990-05-21,0.128633,0.245067,0.227557,0.235886,15.36,298.987143,...,13.90,80.337143,15.36,16.672857,2.428571,27.471429,6.771429,33.3,23.3,4.0
4,sj,1990,22,1990-05-28,0.196200,0.262200,0.251200,0.247340,7.52,299.518571,...,12.20,80.460000,7.52,17.210000,3.014286,28.942857,9.371429,35.0,23.9,5.8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1451,iq,2010,21,2010-05-28,0.342750,0.318900,0.256343,0.292514,55.30,299.334286,...,45.00,88.765714,55.30,18.485714,9.800000,28.633333,11.933333,35.4,22.4,27.0
1452,iq,2010,22,2010-06-04,0.160157,0.160371,0.136043,0.225657,86.47,298.330000,...,207.10,91.600000,86.47,18.070000,7.471429,27.433333,10.500000,34.7,21.7,36.6
1453,iq,2010,23,2010-06-11,0.247057,0.146057,0.250357,0.233714,58.94,296.598571,...,50.60,94.280000,58.94,17.008571,7.500000,24.400000,6.900000,32.2,19.2,7.4
1454,iq,2010,24,2010-06-18,0.333914,0.245771,0.278886,0.325486,59.67,296.345714,...,62.33,94.660000,59.67,16.815714,7.871429,25.433333,8.733333,31.2,21.0,16.0


In [5]:
df.columns

Index(['city', 'year', 'weekofyear', 'week_start_date', 'ndvi_ne', 'ndvi_nw',
       'ndvi_se', 'ndvi_sw', 'prec_amt_mm', 're_an_air_temp_k',
       're_an_avg_temp_k', 're_an_dew_point_temp_k', 're_an_max_air_temp_k',
       're_an_min_air_temp_k', 're_an_precip_amt_kg_per_m2',
       're_an_relative_hd_percent', 're_an_sat_precip_amt_mm',
       're_an_specific_hd_g_per_kg', 're_an_tdtr_k', 'stn_avg_temp_c',
       'stn_diur_temp_rng_c', 'stn_max_temp_c', 'stn_min_temp_c',
       'stn_precip_mm'],
      dtype='object')

In [6]:
lb = pd.read_csv('dengue_labels_train.csv')

In [7]:
lb.year = lb.year.astype('category')

In [42]:
#Merge data
df_merged = df.merge(lb,on=['city','year','weekofyear'],how='inner')

In [43]:
#Handling of NaNs
df_merged.fillna(method='ffill',inplace=True) #add comment not using mean coz spread is long

In [44]:
df_merged["total_cases"] = df_merged["total_cases"].astype('float64')

In [45]:
#Features and labels
X_draft = df_merged.drop(['week_start_date',
                    're_an_sat_precip_amt_mm',              
                    're_an_specific_hd_g_per_kg',           
                    're_an_precip_amt_kg_per_m2',                                
                    're_an_max_air_temp_k',                 
                    're_an_min_air_temp_k',                  
                    'total_cases'],
                     axis=1)
y = df_merged['total_cases']

In [46]:
cat_feature = ["city", 'year']
#convert the attributes to categorical form
for i in cat_feature:
    X_draft[i] = X_draft[i].astype("category")
    X_draft[i] = X_draft[i].astype("category")

#Convert categorical variable into dummy/indicator variables
X_conv = pd.get_dummies(X_draft[['city','year']])

In [47]:
X_conv

Unnamed: 0,city_iq,city_sj,year_1990,year_1991,year_1992,year_1993,year_1994,year_1995,year_1996,year_1997,...,year_2001,year_2002,year_2003,year_2004,year_2005,year_2006,year_2007,year_2008,year_2009,year_2010
0,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1451,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1452,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1453,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1454,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [48]:
scaler = StandardScaler()
#Applying the StandardScaler to all the numerical data
X_scaled = scaler.fit_transform(X_draft.drop(columns=['city', 'year'],axis=1))
X = np.concatenate((X_scaled,np.array(X_conv)),axis=1)

In [49]:
X.shape

(1456, 39)

In [50]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

In [51]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((1164, 39), (292, 39), (1164,), (292,))

In [52]:
type(X_train)

numpy.ndarray

In [53]:
regressor = MLPRegressor(hidden_layer_sizes = (100, 75, 50, 25,), activation = 'relu', solver = 'sgd', 
                         learning_rate = 'adaptive',random_state = 42)


In [54]:
regressor.fit(X_train,y_train)

MLPRegressor(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
             beta_2=0.999, early_stopping=False, epsilon=1e-08,
             hidden_layer_sizes=(100, 75, 50, 25), learning_rate='adaptive',
             learning_rate_init=0.001, max_iter=200, momentum=0.9,
             n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
             random_state=42, shuffle=True, solver='sgd', tol=0.0001,
             validation_fraction=0.1, verbose=False, warm_start=False)

In [55]:
print(regressor.n_outputs_)

1


In [56]:
predictions_train = regressor.predict(X_train)


In [57]:
predictions_train

array([18.7237002, 18.7237002, 18.7237002, ..., 18.7237002, 18.7237002,
       18.7237002])

In [58]:
error_train = mean_absolute_error(predictions_train, y_train.values.ravel())
error_train

21.15932324770071

# Part 2

In [24]:
# Create a new column called above_average where the value is 1 or 0. 1 if the total_cases > median of total_cases
# if df.above_average is greater than median_of_total_cases
df_merged['above_average'] = np.where(df_merged['total_cases']>=df_merged['total_cases'].median(), 1, 0,)

In [25]:
#Features and labels
X_draft_new = df_merged.drop(['week_start_date',
                    're_an_sat_precip_amt_mm',              
                    're_an_specific_hd_g_per_kg',           
                    're_an_precip_amt_kg_per_m2',                                
                    're_an_max_air_temp_k',                 
                    're_an_min_air_temp_k',                  
                    'total_cases','above_average'],axis=1)
y_new = df_merged['above_average']

In [26]:
X_draft_new.head()

Unnamed: 0,city,year,weekofyear,ndvi_ne,ndvi_nw,ndvi_se,ndvi_sw,prec_amt_mm,re_an_air_temp_k,re_an_avg_temp_k,re_an_dew_point_temp_k,re_an_relative_hd_percent,re_an_tdtr_k,stn_avg_temp_c,stn_diur_temp_rng_c,stn_max_temp_c,stn_min_temp_c,stn_precip_mm
0,sj,1990,18,0.1226,0.103725,0.198483,0.177617,12.42,297.572857,297.742857,292.414286,73.365714,2.628571,25.442857,6.9,29.4,20.0,16.0
1,sj,1990,19,0.1699,0.142175,0.162357,0.155486,22.82,298.211429,298.442857,293.951429,77.368571,2.371429,26.714286,6.371429,31.7,22.2,8.6
2,sj,1990,20,0.03225,0.172967,0.1572,0.170843,34.54,298.781429,298.878571,295.434286,82.052857,2.3,26.714286,6.485714,32.2,22.8,41.4
3,sj,1990,21,0.128633,0.245067,0.227557,0.235886,15.36,298.987143,299.228571,295.31,80.337143,2.428571,27.471429,6.771429,33.3,23.3,4.0
4,sj,1990,22,0.1962,0.2622,0.2512,0.24734,7.52,299.518571,299.664286,295.821429,80.46,3.014286,28.942857,9.371429,35.0,23.9,5.8


In [27]:
X_draft_new.isna().sum()

city                         0
year                         0
weekofyear                   0
ndvi_ne                      0
ndvi_nw                      0
ndvi_se                      0
ndvi_sw                      0
prec_amt_mm                  0
re_an_air_temp_k             0
re_an_avg_temp_k             0
re_an_dew_point_temp_k       0
re_an_relative_hd_percent    0
re_an_tdtr_k                 0
stn_avg_temp_c               0
stn_diur_temp_rng_c          0
stn_max_temp_c               0
stn_min_temp_c               0
stn_precip_mm                0
dtype: int64

In [28]:
cat_feature_new = ["city", 'year']
#convert the attributes to categorical form
for i in cat_feature:
    X_draft_new[i] = X_draft_new[i].astype("category")
    X_draft_new[i] = X_draft_new[i].astype("category")

#Convert categorical variable into dummy/indicator variables
X_conv_new = pd.get_dummies(X_draft_new[['city','year']])

In [29]:
X_scaled_new = scaler.fit_transform(X_draft_new.drop(columns=['city', 'year'],axis=1))
X_new = np.concatenate((X_scaled_new,np.array(X_conv_new)),axis=1)

In [30]:
np.nan_to_num(X_new)

array([[-0.56635649, -0.06261869, -0.20367047, ...,  0.        ,
         0.        ,  0.        ],
       [-0.49975322,  0.27894877,  0.11803367, ...,  0.        ,
         0.        ,  0.        ],
       [-0.43314995, -0.71506308,  0.37566221, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [-0.23334015,  0.83612328,  0.15051449, ...,  0.        ,
         0.        ,  1.        ],
       [-0.16673689,  1.46334512,  0.98480586, ...,  0.        ,
         0.        ,  1.        ],
       [-0.10013362,  1.20533821,  0.87771059, ...,  0.        ,
         0.        ,  1.        ]])

In [31]:
X_train_new, X_test_new, y_train_new, y_test_new = train_test_split(X_new, y_new, test_size=0.20)

In [32]:
X_train_new.shape, X_test_new.shape, y_train_new.shape, y_test_new.shape

((1164, 39), (292, 39), (1164,), (292,))

In [33]:
type(X_train_new)

numpy.ndarray

In [34]:
regressor_new = MLPClassifier(hidden_layer_sizes = (100, 75, 50,), activation = 'relu', solver = 'sgd', 
                         learning_rate = 'adaptive',random_state = 42)



In [35]:
regressor_new.fit(X_train_new,y_train_new)



MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(100, 75, 50), learning_rate='adaptive',
              learning_rate_init=0.001, max_iter=200, momentum=0.9,
              n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
              random_state=42, shuffle=True, solver='sgd', tol=0.0001,
              validation_fraction=0.1, verbose=False, warm_start=False)

In [36]:
print(regressor.n_outputs_)

1


In [37]:
predictions_train_new = regressor_new.predict(X_train_new)
print(predictions_train_new)
error_train_new = mean_absolute_error(predictions_train_new, y_train_new.values.ravel())
error_train_new

[1 0 1 ... 0 1 0]


0.22508591065292097

In [38]:
predictions_test_new = regressor_new.predict(X_test_new)
error_test_new = mean_absolute_error(predictions_test_new, y_test_new.values.ravel())
error_test_new

0.2465753424657534

In [60]:
#Precision and Recall
print('precision_score: %.3f' % precision_score(y_test_new, predictions_test_new))
print('Precision = TruePositives / (TruePositives + FalsePositives) \n ')
print('recall_score: %.3f' % recall_score(y_test_new, predictions_test_new))
print('Recall = TruePositives / (TruePositives + FalseNegatives)  \n ')
print('f1_score: %.3f' % f1_score(y_test_new, predictions_test_new))
print('F-Measure = (2 * Precision * Recall) / (Precision + Recall) \n')

precision_score: 0.719
Precision = TruePositives / (TruePositives + FalsePositives) 
 
recall_score: 0.791
Recall = TruePositives / (TruePositives + FalseNegatives)  
 
f1_score: 0.753
F-Measure = (2 * Precision * Recall) / (Precision + Recall) 

