# ML Challenge | Land Classification

Multiple satellites that capture the data about the amount of light intensity reflected at different frequencies from the Earth at a very granular geographic level. Some of this information can be used to classify the Earth into different buckets - built-up, barren, green or water. 

This is a **supervised multi-class classification** machine learning problem.

** Metrics** : Our predictions will be assessed by the **Micro F1 Score**. For mutli-class problems, we have to average the F1 scores for each class. The macro F1 score averages the F1 score for each class without taking into account label imbalances.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import gc # garbage collector
from scipy.stats import norm

# Visualization
import seaborn as sns
color = sns.color_palette()
sns.set(style="darkgrid")

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
plt.rcParams['font.size'] = 18
plt.rcParams['patch.edgecolor'] = 'k'

%matplotlib inline

import plotly.offline as py
import plotly.graph_objs as go
import plotly.tools as tls
py.init_notebook_mode(connected=True)

import warnings
warnings.filterwarnings('ignore', category = RuntimeWarning)

# Always good to set a seed for reproducibility
SEED = 7
np.random.seed(SEED)

pd.options.mode.chained_assignment = None
pd.options.display.max_columns = 50

In [None]:
# check files
from subprocess import check_output
print(check_output(["ls", "../input/input/"]).decode("utf8"))


In [None]:
# Load Data
print("Loading data...")
train = pd.read_csv('../input/input/land_train.csv')
print("Train rows and columns", train.shape)

In [None]:
# [Important] Set random_state for reproducibility

train_df = train.sample(1000)

In [None]:
# delete the train dataset and free some memory
del train
gc.collect()

In [None]:
train_df.head()

That gives us a look at all of the columns which don't appear to be in any order. To get a quick overview of the data we use df.info()

In [None]:
train_df.info()

This tells us there are 7 integer columns, 6 float (numeric) columns, and 0 object columns.

The column names are anonymized and so we do not know what they mean.

In [None]:
train_df['target'].value_counts() # check how many target values belong to each class

In [None]:
# Exploring Label Distribution
plt.figure(figsize=(12,8))
sns.countplot(train_df["target"].values)
plt.xlabel('Target', fontsize=12)
plt.title("Target Histogram", fontsize=14)
plt.show()

It's always a better idea to perform stratified sampling to divide the dataset into train and test set in classification task.

### Missing Values

Now let us check if there are missing values in the dataset.

In [None]:
missing_df = train_df.isnull().sum(axis=0).reset_index()
missing_df.columns = ['column_name', 'missing_count']
missing_df = missing_df[missing_df['missing_count']>0]
missing_df = missing_df.sort_values(by='missing_count')
missing_df

There are no missing values in the dataset :)

### Numeric Features Description

In [None]:
train_df.describe()

As we can see that the standard deviation is very high that means the data is very much spread out.

## Univariate Analysis

In [None]:
train_df['X1'].plot(kind="hist")

In [None]:
sns.distplot(train_df['X1'], fit = norm)

Note : The feature is bimodal distribution, hence we have to either transform the variables or we have to use the algorithms that does not follow normal distribution assumption like SVM, random forest etc.

In [None]:
# Categorical plot for X1
sns.catplot(x="target", y="X1", kind="swarm", data=train_df);

In [None]:
# Categorical Plots of X2
sns.catplot(x="target", y="X2", kind="swarm", data=train_df);

**Conclusion**:  Some values are really high and since we don't know what the variable - we cannot be sure wheather that is possible or not. For model building we would remove the outliers.

## Multivariable Analysis

### Plotting pairwise data relationships

In [None]:
sns.pairplot(train_df[['X1','X2','X3','X4','X5','X6','target']], hue="target", palette="Set2", diag_kind="kde",height = 3)

Interesting, as we can see we have a lot of outliers present in our data. But we will investigate them further.

In [None]:
sns.pairplot(train_df[['I1', 'I2', 'I3', 'I4', 'I5', 'I6',
       'target']], hue="target", palette="Set2", diag_kind="kde",height = 3)

### Plotting Correlation Matrix

In [None]:
#correlation heatmap of dataset
def correlation_heatmap(df):
    _ , ax = plt.subplots(figsize =(14, 12))
    colormap = sns.diverging_palette(220, 10, as_cmap = True)
    
    _ = sns.heatmap(
        df.corr(), 
        cmap = colormap,
        square=True, 
        cbar_kws={'shrink':.9 }, 
        ax=ax,
        annot=True, 
        linewidths=0.1,vmax=1.0, linecolor='white',
        annot_kws={'fontsize':12 }
    )
    
    plt.title('Pearson Correlation of Features', y=1.05, size=15)

correlation_heatmap(train_df)

Althouh the number of variables are less, we have a lot of correlated features. We have to remove them.

## Outliers investigation

In [None]:
def OutLiersBox(df,nameOfFeature):
    
    trace0 = go.Box(
        y = df[nameOfFeature],
        name = "All Points",
        jitter = 0.3,
        pointpos = -1.8,
        boxpoints = 'all',
        marker = dict(
            color = 'rgb(7,40,89)'),
        line = dict(
            color = 'rgb(7,40,89)')
    )

    trace1 = go.Box(
        y = df[nameOfFeature],
        name = "Only Whiskers",
        boxpoints = False,
        marker = dict(
            color = 'rgb(9,56,125)'),
        line = dict(
            color = 'rgb(9,56,125)')
    )

    trace2 = go.Box(
        y = df[nameOfFeature],
        name = "Suspected Outliers",
        boxpoints = 'suspectedoutliers',
        marker = dict(
            color = 'rgb(8,81,156)',
            outliercolor = 'rgba(219, 64, 82, 0.6)',
            line = dict(
                outliercolor = 'rgba(219, 64, 82, 0.6)',
                outlierwidth = 2)),
        line = dict(
            color = 'rgb(8,81,156)')
    )

    trace3 = go.Box(
        y = df[nameOfFeature],
        name = "Whiskers and Outliers",
        boxpoints = 'outliers',
        marker = dict(
            color = 'rgb(107,174,214)'),
        line = dict(
            color = 'rgb(107,174,214)')
    )

    data = [trace0,trace1,trace2,trace3]

    layout = go.Layout(
        title = "{} Outliers".format(nameOfFeature)
    )

    fig = go.Figure(data=data,layout=layout)
    py.iplot(fig, filename = "Outliers")

In [None]:
OutLiersBox(train_df,'X1') # Outliers X1

In [None]:
OutLiersBox(train_df,'X2') # Outliers X2

In [None]:
OutLiersBox(train_df,'X3') 

In [None]:
OutLiersBox(train_df,'X4')

In [None]:
OutLiersBox(train_df,'X5')

In [None]:
OutLiersBox(train_df,'X6')

In [None]:
OutLiersBox(train_df,'I1')

In [None]:
OutLiersBox(train_df,'I2')

In [None]:
OutLiersBox(train_df,'I3')

In [None]:
OutLiersBox(train_df,'I4')

In [None]:
OutLiersBox(train_df,'I5')

In [None]:
OutLiersBox(train_df,'I6')

## Conclusion

So, from intial analysis we find out following:.

* X1 to X6 have outliers.
* No missing data
* Numeric Features have bimodal distribution ( either convert to normal using Box - Cox or use algorithms where the normality does not have a effect like SVM, random forest. )
* Correlated Features Detected.
* The dataset is not perfectly balanced. 


