# The Nature of Data and Statistical Modeling

These examples will explore some introductory data and statics concepts using a [Jupyter](http://jupyter.org/) notebook, the [Python](https://www.python.org/) programming language, the [pandas](http://pandas.pydata.org/) Python library, and a fictional [dataset](https://www.mindsumo.com/contests/building-better?permit=3e8d081171c02dd6) for a Wells Fargo analytics challenge.

Code can be entered/edited in cells beginning with `In`.  To execute code, press `SHIFT + ENTER`.

## Part 1: Data Example

To begin, we'll load the *pandas* library that will be useful in loading, exploring, and manipulating the data.  Once we've loaded the library, we'll read data from a file located at `http://biws.cscc.arthurneuman.com/wf_fake_balances.csv` and store it in a variable named `balances`. Included in this data set are end-of-month account balances along with other, related data. The type of object used to store our data is known as a *DataFrame*; we'll often use DataFrames to store data when working with pandas. The columns of a DataFrame are *Series* objects. We can easily see a portion of the data when we work with DataFrames.

In [None]:
# this is a comment in python
# load pandas
import pandas 

In [None]:
# load data from file
balances = pandas.read_csv('http://biws.cscc.arthurneuman.com/wf_fake_balances.csv')

In [None]:
# display some data
balances

We can see that there are many columns.  To see a full list of the columns, we will make use of the `columns` property; we can use columns' `tolist()` method to make the output easier to read. 

In [None]:
# list of the dataset's column labels
balances.columns.tolist()

Often when we work with data, we will have access to a data dictionary or some other collection of metadata including a description of the columns.  For this dataset, we column descriptions are stored in `/usr/local/share/bi/wf_fake_metadata.csv`.  Though this this file is simply a list and description of columns, we can use pandas to display its contents. 

In [None]:
# load metadata from file and display the file contents
metadata = pandas.read_csv('/usr/local/share/bi/wf_fake_metadata.csv')
metadata

For this lab, we'll only be working with month-end balances, so we're only interested in rows 0 through 25.  Note that `NaN` stands for "Not a number" and is used to represent missing data - blank lines or cells in this case.

The data in this dataset are examples of structured data.  This dataset contains both categorical and numeric data.

### Categorical and Numeric Data

Categorical data can be either nominal or ordinal.  Typically, numeric values are assigned to categorical data to make processing easier.  Recall that the difference between nominal and ordinal data is that we can order ordinal data - we can rank values saying one value is "higher", "greater", or "better" than another value.

Looking at the data above, the *cc_flag* column appears to contain nominal data and to have only two values: `0` and `1`.  We can confirm this using the `cc_flag` property of the `balances` object to access only the *cc_flag* column, then using the column's `unique()` method. Again, we'll use `tolist()` to make the output easier to read.

In [None]:
# unique values of the cc_flag column
balances.cc_flag.unique().tolist()

The dataset also includes numerical data.  The balances of various customer accounts are examples of ratio variables. Recall that with interval data, all the properties of ordinal data apply and the difference between values is meaningful. Ratio data has the additional property that zero is non-arbitrary and that ratios between values have meaning.  

In the space below, display all the values of a column using the `tolist()` method.

In [None]:
# display the values of a column in balances


## Part 2: Descriptive Statistics

With pandas, we can easily compute descriptive statistics.  For example, we can see the mean, median, and mode for the number of monthly online bank transactions per customer.  We can use the `print()` function if we want to display multiple things on a line and within a block of statements. Note that there could be multiple modes; we'll use the `values` property to display all the modes.

In [None]:
# mean
print("Mean: ", balances.online_bank_cnt.mean())

# median
print ("Median: ", balances.online_bank_cnt.median())

# mode
print("Modes: ", balances.online_bank_cnt.mode().values)

DataFrames and Series (the type of object we use when working with columns directly) also have a `describe()` method that can be used to display descriptive statistics related to central tendency and dispersion. Note that columns without numeric data will be omitted from the results.

In [None]:
# Descriptive statistics for each column in the DataFrame
balances.describe()

In [None]:
# Descriptive statistics for a specific column
balances.online_bank_cnt.describe()

The pandas libary also provides plotting capability through the use of another Python libary, [matplotlib](https://matplotlib.org/).  We can generate a box plot using a DataFrame's `boxplot()` method and specifying the column name we're interested in or using the Series' `plot()` method and specifying the `kind` of plot to generate.

In [None]:
# configure the notebook to embed plots
%matplotlib inline

#box plot
balances.online_bank_cnt.plot(kind='box')

Here, we can see that there are quite a few outliers.  In fact, if we compare the mean and the median, represented by the 50th-percentile value above, we can get a sense of how much outliers can affect the mean.  

If we're not interested in outliers, we can generate a plot without them by specifying `showfliers=False` in the `plot()` method.  We could also change the scale of the y-axis to be logarithmic using `logy=True`.

In [None]:
# box plot of account balance without outliers
balances.online_bank_cnt.plot(kind='box', showfliers=False)

In [None]:
# box plot with logarithmic scaling of y-axis
balances.online_bank_cnt.plot(kind="box", logy=True)

In the space below, generate the box plot of one of the columns representing account balance.

We can plot the histogram using the `plot()` method and specifying `kind='hist'`

In [None]:
balances.online_bank_cnt.plot(kind='hist')

We can also calculate the skewness and kurtosis.

In [None]:
# skewness
print("Skewness: ", balances.online_bank_cnt.skew())

# kurtosis
print("Kurtosis: ", balances.online_bank_cnt.kurtosis())

Recall that skew indicates whether more values are on the left or right side of the distribution.  Here, a positive value for skewness is consist with the histogram indicating that there are more values on the left side.  Kurtosis measures how tall/skinny the histogram is compared to the [standard normal distribution](http://mathworld.wolfram.com/StandardNormalDistribution.html).

In the space below, plot the histogram and calculate skewness and kurtosis for the values in the *age* column.

## Part 3: Regression

We can use the data in our DataFrame to calculate linear regressions using ordinary least squares.  [Scikit-Learn](http://scikit-learn.org/stable/) is one library that we can use to do the necessary calculations.  

As an example, we'll explore the relationship between age and branch visits.  To start, we'll assign the explanatory data (age) to a variable named `X` and the response data (branch visits)to a variable named `y`.  Note that we need to change how the explanatory data is stored using the `reshape()` method - this is a requirement of scikit-learn.

In [None]:
# store exaplanatory and response data in new variables
X = balances.age.values.reshape(-1, 1)
y = balances.branch_visit_cnt

Now we can make use of the scikit-learn library to calculate the regression coefficient and intercept.  Because we only have one explanatory variable, we will work with a simple, or one-dimensional, linear regression.

In [None]:
# import LinearRegression from the sklearn library
from sklearn.linear_model import LinearRegression

# create a LinearRegression object
regression = LinearRegression()

# calculate the coefficient and intercept with existing data
regression.fit(X, y)

We can display the calculated coefficient and intercept.

In [None]:
print("Coefficient: ", regression.coef_)
print("Intercept: ", regression.intercept_)

If we were to write an equation using these values (with rounding), we would have:

y = -0.18*x + 0.12

We can also plot the data and the regression line.  When we generated plots previously, we made use of matplotlib through pandas, for our next plot, we'll have to make use of matplotlib directly. When we plot the regression line, we should use a new set of values for the explanatory variable; the values should be increasing from the smallest value in the original data to the largest value in the original data. To do this, we'll make use of the [numpy](http://www.numpy.org/) library, which we can access using panadas.

In [None]:
# import part of matplotlib
import matplotlib.pyplot as plt

# create a scatter plot with our data 
plt.scatter(X, y,  color='black')

# for plotting, use new range of values for explanatory variable
X_plot = pandas.np.linspace(X.min(), X.max()).reshape(-1, 1)

# plot a line using the range of values as x-values
# and values calculated with the regression for y-values
plt.plot(X_plot, regression.predict(X_plot), color='blue', linewidth=1)

# show the plot
plt.show()

We can also calculate the coefficient of determination to see how well the regression fits the data. If the regression perfectly fit the data, we'd expect to see a value of 1.

In [None]:
# coefficient of determination:
regression.score(X, y)

In the space below, calculate and plot a linear regression using another column as the response variable.  You can repeat all the steps above but be sure to use a different column when assigning data to the `y` variable.