<h1>Visual Inspection of Data</h1>
In this side bar, we want to visually examine data.  In the lecture we discussed the idea of correlation between a target and input variables <b>but</b> how could we examine this in python?<br/><br/>
The most common graphical method to visualise correlation is to use a scatter plot.  Here, we plot the occurance of 2 variables against each other (the graph will show a point for each value in the data).  So if 2 variables are correlated, there should be a straight line in the scatter.<br/><br/>
Lets examine how this can be done with the veteran dataset (how we create a plot), then, lets examine how we can automate the creation of the charts.  This will speed up our analysis.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# read the veteran dataset
df = pd.read_csv('datasets/veteran.csv')

If we look at our dataframe, we'll see that there are lots of numeric fields which could be charted against the a target variable.<br/>
<b>Assuming</b> that the value of the contribution is relative to some other field in the data, how might we observe that visually?

In [None]:
df.info()

Lest suppose we want to examine how an attribute (GiftCnt36) relates to another (lets use TargetD) so we create a scatterplot.<br/><br/>

In [None]:
plt.scatter(df['TargetD'], df['GiftCnt36'])
plt.xlabel('TargetD')
plt.ylabel('GiftCnt36')
plt.title('Scatter TargetD v GiftCnt36')
plt.show()

What does highly correlated data look like?  Simple, lets look at how the target relates to itself

In [None]:
plt.scatter(df['TargetD'], df['TargetD'])
plt.xlabel('TargetD')
plt.ylabel('TargetD')
plt.title('Scatter TargetD v TargetD')
plt.show()

<b>We can bundle the creation of a chart into a function so that we only need to provide it with column names</b>

In [None]:
def makeplot(x, y):
    plt.scatter(df[x], df[y])
    plt.xlabel(x)
    plt.ylabel(y)
    plt.title('Scatter ' + x + ' v ' + y)
    plt.show()

In [None]:
makeplot('TargetD', 'GiftCnt36')

Now all we need is a list of columns to pass to our function (since we know we are going to target TargetD).  Lets get a list of all numeric columns except for those which we know shouldnt be plotted against our target (say for example becuase the attribute is meaningless).

In [None]:
cols_numeric = list(df._get_numeric_data().keys())
cols_remove = ['TargetB', 'ID', 'TargetD']
[cols_numeric.remove(x) for x in cols_remove]

what columns are numeric?

In [None]:
cols_numeric

Now we can iterate over each column and see if there are any blindingly obvious correlations

In [None]:
%matplotlib inline
plt.rcParams['figure.figsize']=(5,5)
for col1 in cols_numeric:
    for col2 in cols_numeric:
        if col1 != col2:
            print(col1, 'v', col2)
            makeplot(col1, col2)

<h3>Trellis chart (seaborn Pairplot)</h3> Takes the attributes(variables) and x and y axis values and plots a chart for each intersection (you'll see what i mean soon).<br/>
The Trellis chart is a generic term is visualisation so let's look at how we can examine multiple-variables in the one image

In [None]:
import seaborn as sns

In [None]:
temp = df.fillna(0)

In [None]:
temp.head()

if we pass a dataframe, the pair plot will plot all columns 

In [None]:
sns.pairplot(temp)
plt.show()

In [None]:
sns.pairplot(temp[['GiftCnt36','GiftCntAll', 'GiftCntCard36', 'GiftCntCardAll', 'GiftAvgLast']], size=3)
plt.show()