# The Nature of Data and Statistical Modeling

These examples will explore some introductory data and statics concepts using a [Jupyter](http://jupyter.org/) notebook, the [Python](https://www.python.org/) programming language, the [pandas](http://pandas.pydata.org/) Python library, and a fictional [dataset](https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset) from [kaggle](https://www.kaggle.com/).

Code can be entered/edited in cells beginning with `In`.  To execute code, press `SHIFT + ENTER`.

## Part 1: Data Example

To begin, we'll load the *pandas* library that will be useful in loading, exploring, and manipulating the data.  Once we've loaded the library, we'll read data from a file located at `/usr/local/share/bi/attirition.csv` and store it in a variable named data. The type of object used to store our data is known as a *DataFrame*; we'll often use DataFrames to store data when working with pandas. We can easily see a portion of the data when we work with DataFrames.

In [2]:
# this is a comment in python
# load pandas
import pandas 

In [3]:
# load data from file
data = pandas.read_csv('/usr/local/share/bi/attrition.csv')

In [10]:
# display some data
data

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2
5,32,No,Travel_Frequently,1005,Research & Development,2,2,Life Sciences,1,8,...,3,80,0,8,2,2,7,7,3,6
6,59,No,Travel_Rarely,1324,Research & Development,3,3,Medical,1,10,...,1,80,3,12,3,2,1,0,0,0
7,30,No,Travel_Rarely,1358,Research & Development,24,1,Life Sciences,1,11,...,2,80,1,1,2,3,1,0,0,0
8,38,No,Travel_Frequently,216,Research & Development,23,3,Life Sciences,1,12,...,2,80,0,10,2,3,9,7,1,8
9,36,No,Travel_Rarely,1299,Research & Development,27,3,Medical,1,13,...,2,80,2,17,3,2,7,7,7,7


We can see that there are many columns.  To see a full list of the columns, we will make use of the `columns` property; we can use columns' `tolist()` method to make the output easier to read.  A description of some of these columns and their values can be found on the [dataset's kaggle page](https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset).



In [15]:
# list of the dataset's column labels
data.columns.tolist()

['Age',
 'Attrition',
 'BusinessTravel',
 'DailyRate',
 'Department',
 'DistanceFromHome',
 'Education',
 'EducationField',
 'EmployeeCount',
 'EmployeeNumber',
 'EnvironmentSatisfaction',
 'Gender',
 'HourlyRate',
 'JobInvolvement',
 'JobLevel',
 'JobRole',
 'JobSatisfaction',
 'MaritalStatus',
 'MonthlyIncome',
 'MonthlyRate',
 'NumCompaniesWorked',
 'Over18',
 'OverTime',
 'PercentSalaryHike',
 'PerformanceRating',
 'RelationshipSatisfaction',
 'StandardHours',
 'StockOptionLevel',
 'TotalWorkingYears',
 'TrainingTimesLastYear',
 'WorkLifeBalance',
 'YearsAtCompany',
 'YearsInCurrentRole',
 'YearsSinceLastPromotion',
 'YearsWithCurrManager']

The data in this dataset are examples of structured data.  This dataset contains nominal, ordinal, interval, and ratio data.

### Categorical Data

Categorical data can be either nominal or ordinal.  Typically, numeric values are assigned to categorical data to make processing easier.  Recall that the difference between nominal and ordinal data is that we can order ordinal data - we can rank values saying one value is "higher", "greater", or "better" than another value.

Looking at the data above, the *Attrition* column appears to have only two values: `yes` and `no`.  We can confirm this using the `Attrition` property of the `data` object to access only the *Attrition* column, then using the column's `unique()` method. Again, we'll use `tolist()` to make the output easier to read.

In [19]:
# unique values of the attrition column
data.Attrition.unique().tolist()

['Yes', 'No']

We can replace these text values with numeric values.  One way to do this is using the DataFrame's `replace()` method. To do this, we use a *dictionary*, or collection of key-value pairs, to indicate how the replacement should be made.  Our dictionary, `replacement`, will use the current data as the key and the desired value for the associated values. We'll replace `No` with `0` and `Yes` with `1`. 

In [35]:
# create a dictionary for replacement
# dictionaries are surrounded by { and }, use commas to separate each key-value pair, 
#   and use colons to separate keys and values
# Example: { "key1": "value1", "key2": "value"2}
replacement = {"No": 0, "Yes": 1}

# replace the DataFrame's Attrition values using inplace=True to update the original DataFrame
data.Attrition.replace(replacement, inplace=True)

In [36]:
# display data. Note the change in the Attrition column's values.
data

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,1,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,0,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,1,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,0,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,0,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2
5,32,0,Travel_Frequently,1005,Research & Development,2,2,Life Sciences,1,8,...,3,80,0,8,2,2,7,7,3,6
6,59,0,Travel_Rarely,1324,Research & Development,3,3,Medical,1,10,...,1,80,3,12,3,2,1,0,0,0
7,30,0,Travel_Rarely,1358,Research & Development,24,1,Life Sciences,1,11,...,2,80,1,1,2,3,1,0,0,0
8,38,0,Travel_Frequently,216,Research & Development,23,3,Life Sciences,1,12,...,2,80,0,10,2,3,9,7,1,8
9,36,0,Travel_Rarely,1299,Research & Development,27,3,Medical,1,13,...,2,80,2,17,3,2,7,7,7,7


Some of the data is ordinal.  For example, look at the *BusinessTravel* column.

In [37]:
# display unique values of the BusinessTravel column
data.BusinessTravel.unique().tolist()

['Travel_Rarely', 'Travel_Frequently', 'Non-Travel']

We can order these values based on the amount of travel each value represents based on frequency of travel: `Non-Travel` then `Travel_Rarely` followed by `Travel_Frequently`.  We could replace these text values with numeric values if necessary.