This is my attempt at some exploratory data analysis on the Forest cover dataset. Also my first public kernel!

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
filename = '../input/train.csv'

In [None]:
# Put the data into a dataframe
df = pd.read_csv(open(filename))

**Lets have an overview of the data**

In [None]:
df.describe()

In [None]:
df.info()

*No missing values!*

In [None]:
# How many samples of each cover type are there?
df["Cover_Type"].value_counts().plot(kind='bar',color='gold')
plt.ylabel("Number of Occurences")
plt.xlabel("Cover Type")

*All cover types have equal representation!*

In [None]:
# Extract column names from the dataset
col_names = df.columns.tolist()

Lets see some properties of the continuous variables. We will check out the categorical variables later.

In [None]:
for name in col_names:
    if name[0:4] != 'Soil' and name[0:4] != 'Wild' and name != 'Id' and name != 'Cover_Type':
        plt.figure()
        sns.distplot(df[name]);

*Since some variables are left/right skewed, normalization may be useful, going forward.* 

In [None]:
for name in col_names:
    if name[0:4] != 'Soil' and name[0:4] != 'Wild' and name != 'Id' and name != 'Cover_Type':
        title = name + ' vs Cover Type'
        plt.figure()
        sns.stripplot(df["Cover_Type"],df[name],jitter=True)
        plt.title(title);

*Slope and aspect seem to be identical across cover types, offering almost no discrimination value.* 

Lets see the correlation between the variables now.

In [None]:
vars = [x for x in df.columns.tolist() if "Soil_Type" not in x]
vars = [x for x in vars if "Wilderness" not in x]
df1 = df.reindex(columns=vars)

In [None]:
corrmat = df1.corr()
f, ax = plt.subplots(figsize=(12, 7))
sns.heatmap(corrmat, vmax=.5, square=True);

<p><em>The vertical and horizontal distances to hydrology are strongly correlated.</em><br />
<em>Hillshade_3pm is highly positively correlated with Hillshade_noon, and highly negatively correlated with Hillshade_9am. It is also positively correlated with aspect.</em><br />
<em>So we will drop Horizontal_Distance_To_Hydrology and Hillshade_3pm from our analysis.</em></p>


In [None]:
drop_cols = ['Horizontal_Distance_To_Hydrology', 'Hillshade_3pm']
df1 = df1.drop(drop_cols, axis=1)

Lets see if pairs of variables can give us some discrimination between the cover types. For this analysis, I am excluding Slope and Aspect, since we saw earlier that they were almost similar across cover types.

In [None]:
# So which variables are we plotting?
vars = df1.columns.tolist()
remove_cols = ['Id', 'Slope', 'Aspect', 'Cover_Type']
vars = [x for x in vars if x not in remove_cols]
vars

In [None]:
g = sns.pairplot(df, vars=vars, hue="Cover_Type")

*I like the ellipse drawn between the two hillshade variables!*

Lets check out the Wilderness variables now..

In [None]:
col_names_wilderness = [x for x in df.columns.tolist() if "Wilderness" in x]

In [None]:
types_sum = df[col_names_wilderness].groupby(df['Cover_Type']).sum()

In [None]:
ax = types_sum.T.plot(kind='bar', figsize=(13, 7), legend=True, fontsize=12)
ax.set_xlabel("Wilderness_Type", fontsize=12)
ax.set_ylabel("Count", fontsize=12)
plt.show()

<p><em>Wilderness_Area2 has very few samples. All of cover_type4 is in Wilderness_Area4, which is excellent.</em><br />
<em>Distinguishing Cover_Type1 and 2 seems to be very difficult from here.</em></p>

Moving on to Soil_Types..

In [None]:
# How many of each Soil_Type are there?
A = np.array(col_names)
soil_types = [item for item in A if "Soil" in item]
for soil_type in soil_types:
    print (soil_type, df[soil_type].sum())

*Seems some soil types are not present at all!*

In [None]:
# Which soil_types support which cover_types?
types_sum = df[soil_types].groupby(df['Cover_Type']).sum()
types_sum.T.plot(kind='bar', stacked=True, figsize=(13,8), cmap='jet')

<p><em>Cover_Type7 seems to be present mostly in soil types 35-40.<br />
Cover_Type6 is found mostly in soil_type10.<br />
Again, it is difficult to distinguish between Cover_Type1 and 2.</em></p>

In [None]:
# Lets look at it another way.
arr = []

for i in range(1,8):
    for j in range(1,41):
        result = []
        result.append(i)
        result.append(j)
        mystr = 'Soil_Type' + str(j)
        result.append(df[df['Cover_Type'] == i].sum()[mystr])
        arr.append(result)
        
labels = ['Cover_Type', 'Soil_Type' , 'Sum']
df1 = pd.DataFrame.from_records(arr, columns=labels)

In [None]:
plt.figure(figsize=(15,5))
distt = df1.pivot("Cover_Type", "Soil_Type", "Sum")
ax = sns.heatmap(distt)

<p><em>I read it left-to-right. This gives the same information as the stacked plot earlier, but from the point of view of cover_types.</em></p>

In [None]:
#Lets drop the columns with 0 samples
drop_cols = [item for item in soil_types if df[item].sum() == 0]
drop_cols

In [None]:
df = df.drop(drop_cols, axis=1)

*Need to make a note of this dropping of columns*!

Please let me know how you liked this analysis. Suggestions for improvement are most welcome!
Hopefully I will post some modelling results soon too..