##Exploratory Data Analysis - Basic

Firstly lets import the libraries that we will need for our data analysis. 

*Numpy* - Advanced mathematical functions and linear algebra

*Pandas* - Data analytics and easy CSV input / output

*Matplotlib* - Basic plotting functionality

*Seaborn* - "Snazzier" plots - automatically updates matplotlib plots

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

pd.options.mode.chained_assignment = None  # default='warn'
pd.set_option('display.max_columns', 500)

Read in the the train, test and macro data

In [None]:
train_df = pd.read_csv('../input/train.csv',)
test_df = pd.read_csv('../input/test.csv')
macro_df = pd.read_csv('../input/macro.csv')

id_test = test_df.id

The read_csv function in pandas stores the csv in a special type of array from the Numpy library. 

We can access some of the functionality of this object to begin our data exploration. 

Lets start by looking at the first and last few records of each data set to get a feel for what they look like.

In [None]:
train_df.head() # this will show us the first five records in the training set

In [None]:
train_df.tail() # this will show us the last five records in the training set

Already we are getting a feel for the data - note that NaN is short for Not a Number and represents missing data.

Lets continue with a look at the test set.

In [None]:
test_df.head()

In [None]:
test_df.tail()

And finally lets have a quick look at the macro data.

In [None]:
macro_df.head()

In [None]:
macro_df.tail()

It is also useful to look at the "shape" of each dataset - i.e the number of rows and columns.

In [None]:
test_df.shape

In [None]:
train_df.shape

In [None]:
macro_df.shape

Next we take a look at the datatypes contained in the training set. 

In [None]:
data_types = train_df.dtypes
data_types

In [None]:
##Count of different datatypes
plt.figure(figsize=(10,8))
sns.countplot(data_types,)
plt.show()

Now lets look at those missing values and see how many we have. 

For now lets focus on the training set - 
and in particular the variable we are asked to predict - the house price.
This is represented by the price_doc variable.

We start with a scatter plot to check for outliers.

In [None]:
plt.figure(figsize=(8,6))
plt.scatter(range(train_df.shape[0]), np.sort(train_df.price_doc.values))
plt.xlabel('index', fontsize=12)
plt.ylabel('price', fontsize=12)
plt.show()

Looks ok so lets do a distribution plot. 

In [None]:
##Visualizing the target data
plt.figure(figsize=(10,6))
sns.distplot(train_df['price_doc'],kde=False,bins=50)
plt.xlabel('price')
plt.show()

In [None]:
#We can see the data is positively skewed and the range in large.We can also use the logarithmic plot to visualize the data better.
##Lets plot log of target variable
plt.figure(figsize=(10,6))
sns.distplot(np.log(train_df['price_doc']),kde=False,bins=50)
plt.xlabel('price')
plt.show()

In [None]:
##Lets visualize the internal characteristics of the house and its relation with price
plt.figure(figsize=(10,8))
internal_characteristics=['full_sq', 'life_sq', 'floor', 'max_floor', 'material',
                          'num_room', 'kitch_sq','price_doc']
heatmap_data=train_df[internal_characteristics].corr()
sns.heatmap(heatmap_data,annot=True)
plt.show()

In [None]:
##We can see a high co-relation between the full_sq and the  num of rooms

In [None]:
##We will now visualize the number of houses built each year
grouped_data_count=train_df.groupby('build_year')['id'].aggregate('count').reset_index()
grouped_data_count.columns=['build_year','count']

In [None]:
##Lets check the minimum and maximum build year dates
print (grouped_data_count.iloc[grouped_data_count['build_year'].idxmax()])
print (grouped_data_count.iloc[grouped_data_count['build_year'].idxmin()])

grouped_data_count[grouped_data_count['build_year']>2018]

In [None]:

train_data=train_df[train_df['build_year']<2019]

In [None]:
##These values clearly suggests that this is not
#correct and needs to be rectified during our data cleaning process

#Lets visualize this data
grouped_data_count=grouped_data_count[(grouped_data_count['build_year']>1950) & (grouped_data_count['build_year']<2018) ]
plt.figure(figsize=(10,8))
sns.barplot(grouped_data_count['build_year'],grouped_data_count['count'],color='r')
plt.xticks(rotation='vertical')
plt.show()

In [None]:
##Lets start working on data cleaning and removing bad data from dataset
##We will start by visualizing the missing data in all the columns

train_missing=train_data.isnull().sum()/len(train_data)
train_missing=train_missing.drop(train_missing[train_missing==0].index).sort_values(ascending=False).reset_index()
train_missing.columns=['column name','missing percentage']
plt.figure(figsize=(12,8))
sns.barplot(train_missing['column name'],train_missing['missing percentage'],palette='coolwarm')
plt.xticks(rotation='vertical')
plt.show()