# Chapter 1: Exploratory Data Analysis

Modern data analysis was pioneered by the statician John Tukey in the 1960s and 1970s. Across these two decades Tukey published several impactful works which laid the foundation for data analysis and science. Following his publications, the field of data science has enjoyed a rapid growth in popularity driven by technology developments and greater access to data.

## Elements of Structured Data

___
**Key Terms**
1. Numeric: Data that are expressed on a numeric scale.
2. Continuous: A type of numeric data that can take on any value in an interval. 
3. Discrete: A type of numeric data that can take on only integer values, such as counts.
4. Categorical: Data that can yake on only a specific set of values representing a set of possible categories.
5. Binary: A type of categorical data with just two categories of values such as 0/1 or false/true.
6. Ordinal: A type of categorical data that has an explicit ordering. 
___

The IoT and other devices have led to a massive influx of new, raw data. Signals from sensors, images, user clicks, and text are just a few examples of the data being generated daily. One of the major challenges of data science is harnessing this corpus of unstructured data into actionable info. To apply the statistical concepts in this text, raw data must be processed and transformed into a structured form. One of the most common structured data types is tables with rows/columns such as the data stored in a relational database.

The two basic types of structured data are numeric and categorical. Numeric data can either be continuous(speed or time duration) or discrete (number of people or number of events) in nature. Categorical data can only take on a limited number of values such types of TV screens (plasma, LED, etc.) or the name of states (Alaska, Pennsylvania, etc.). Binary data is a special case of categoricl data where the value can only take on two possible values such 0/1 or false/positive. Ordinal data os another type of categorical data in which there is an ordering of the cateries like a numerical rating (1,2,3,4,5).

The data types in a given dataset is important in deciding which the visual displays, analysis, and models which we will employ. Further, data types are exploited in numerical software to make statistical packages more efficient. Someone may ask why categorical data is needed as a data type since they are ultimately handled as text or numeric data during computation. However, the disticntion between and categorical and text data is useful for a few reasons:
1. Storing data as categorical can signal to software packages how certain statistical procedures such as producing a chart or fitting a model should behave.
2. Optimatization of storage and indexing
3. The possible values of a categorical variable can be enforced in the software.

___
**Key Ideas**
1. Data is classified as in software by type.
2. Data types include numeric(continous,discrete) and categorical(binary, ordinal).
3. Data types can signal to software how to process the data.






## Rectangular Data

__
**Key Terms**
1. Data frame: Rectangular data such as a spreadsheet are the basic data structure for statistical and ML models.
2. Feature: A column within a data table. Also may be referred to as attribute, input, predictor, or variable.
3. Outcome: Data science projects often involve predicting an outcome such as yes/no. The features are often used to predict the output in an experiment or study. Also may be referred to as dependent variable, response, target, or output.
4. Records: A row within a table. Also referred to as case, instance, observation, pattern, or sample.
___



| Category        | Currency | sellerRating | Duration | endDay | ClosePrice | OpenPrice | Competive? |
| :---:           |:---:     | :---:        | :---:    | :---:  | :---:      | :---:     | :---:      |
| Music/Movie/Game| US       | 3249         | 5        | Mon    | 0.01       | 0.01      | 0          |
| Music/Movie/Game| US       | 3249         | 5        | Mon    | 0.01       | 0.01      | 0          |
| Automotive      | US       | 3115         | 7        | Tue    | 0.01       | 0.01      | 0          |
| Automotive      | US       | 3115         | 7        | Tue    | 0.01       | 0.01      | 0          |
| Automotive      | US       | 3115         | 7        | Tue    | 0.01       | 0.01      | 0          |
| Automotive      | US       | 3115         | 7        | Tue    | 0.01       | 0.01      | 0          |
| Automotive      | US       | 3115         | 7        | Tue    | 0.01       | 0.01      | 1          |
| Automotive      | US       | 3115         | 7        | Tue    | 0.01       | 0.01      | 1          |

<center>
    Table 1.1: A typical data frame format
</center>

The typical frame of reference for an analysis in data science is rectangular data such that in a spreadsheet or database table. Rectangular data is the general name for the data with instances stored in each row and features stored in each column. Data frames are specific examples of a rectangular data format in python and R. A common task is taking unstructured data and formatting into a single table or pulling data from several sources into a single table.

Table 1.1 shows a typical data frame which contains both numeric and categorical data. The data is from an different auctions. The 'Competive?' feature is an example of a binary data. This feature could act as a target variable for a predictive model to determine if a given auction was competitive or not.

Database tables contain one or more columns which act as an index or row numer. The python libary *pandas* automatically adds a row index to a data frame when its loaded. It's possible to add mulitlevel/hierarchical indices to improve efficiency. 

There are important data types which are inherently not rctangular and must be handled in a different manner. Time series data such signals from IoT sensors are succesive measurements of a variable. This can be used in forecasting models. Spatial data used in 3D mapping and location analytics are more complex than rectangular data. The data is an object like a building and its spatial coordinates. Graph or network data can be used to reprent the relationship between entities. For example, a network could reprent the friends on a social media website where user are nodes and a connection between users indicates friendship. 

___
**Key Ideas**
1. The basic data structure in data science is a data frame with rows (representing records) and columns (representing features).
2. The terminology can be confusing since different fields use different words to refer to the same thing. For instance, a feature can also be referred to as a attribute or input.
___

## Estimates of Location

___
**Key Terms**
1. Mean (or average): Sum of all values divided by the number of values
2. Weigthed Mean (or average): Sum of all values times a weight divided by the sum of the weights
3. Median (or 50th percentile): The value such that 50% of the data lies above and below.
4. Percentile (or quantile): The value such that $P$% of the data lies below.
5. Weighted Median: The value such that one half of the sum of the weights lies above and below the sorted data.
6. Trimmed Mean (or truncated mean): The average of all values after dropping a fixed number of extreme values.
7. Robust (or resistant): Not sensitive to outliers.
8. Outlier (or extreme value): a value that is very different from most of the data.
___

Variable with measured or counted data may have thousands of entries with distinct values. We need a way to get a typical value for such features in a dataset. This measure gives gives us an estimate of where most of the data is located. The simplest method is to compute the mean value of the feature. However, this may not always be the best way to measure the central value. For this reason, a wide variety of central measures have been developed.

### Mean
The most baisc estimate of location, aka the average value. For a set of values $x_i$, the mean value, denoted $\bar{x}$ is given by
$$\bar{x}=\frac{\sum_{i=1}^{n} x_i}{n}$$
where $n$ is the numer of values. The trimmed mean is a variation where we sort the values, drop a certain number of the minimum and maximum values, then compute the mean with the remaining values. Say we want to omit the $p$ number of of smallest and largest values, then the trimmed mean $\bar{x}$ is 
$$\bar{x}=\frac{\sum_{i=p+1}^{n-p}x_i}{n-2p}$$
The trimmed mean elimates extreme values from our calculation. Another type of mean is the weighted mean denoted $x_w$ where each value $x_i$ is multiplied with a corresponding weight $w_i$. 
$$x_w=\frac{\sum_{i=1}^{n}w_ix_i}{\sum_{i=1}^{n}w_i}$$
There are two reasons we may want to use a weighted mean. The first is that sometimes some values are inherently more variable than others. Lower weights are assigned to highly variable values. For instance, one sensor in a network may be less accurate than the others and as such be assigned a low weight. The other reason is that the gatheed data may not equally represent the diffent groups that we are interested in measuring. For instance in political polling it is quite difficult to get a sample of respondents which matches the county's voters demographically. As such, responses are weighted based on demographics such as age, race, and gender to better reflect match the demographics of likely voters.

### Median and Robust Estimates
the median is the middle number in a sorted list of data. If there are a even number of values in the data, then the median value is the average of the two middle values. The meadian value can be useful to calculate particularly when the data has a large range of values. We can aslo compute a weighted median which is robust to outliers. Instead of the middle value, the weighted median is the value such that the sum of the weights is equal for the lower and upper halves of the sorted list. An outlier is any value that is very distant from the other values in a dataset. The exact definition is somewhat subjective. Outliers can be valid values, but are often generated by data errors(for instance mixing feet and meter values) or faulty sensors. Outliers can be lead to a poor mean value, but the median value is typically still valid. Outliers should be identified and investigated based on the dataset and problem at hand. Typical data analysis, outliers are can be informative or cause trouble. However, if the focus of the analysis is the outliers, it's referred to as anamoly detection. In this case, the mass of regular values is used define what an constitutes an outlier. 

### Example: Location Estimates of Population and Murder Rates
In the next few cells, we calculate some location values using population and murder rates in US states from the 2010 census.

In [3]:
from scipy.stats import trim_mean
import numpy as np
import pandas as pd

#loading the dataset from local storage and showing a few rows
state= pd.read_csv('data/state.csv')
state.head()

Unnamed: 0,State,Population,Murder.Rate,Abbreviation
0,Alabama,4779736,5.7,AL
1,Alaska,710231,5.6,AK
2,Arizona,6392017,4.7,AZ
3,Arkansas,2915918,5.6,AR
4,California,37253956,4.4,CA


In [4]:
#calculating the mean, trimmed mean, and median values of the population column

meanPop= state['Population'].mean()
trimMeanPop= trim_mean(state['Population'],0.1)
medianPop= state['Population'].median()

print('Mean Population=',meanPop,'----- Trimmed Mean Population=',trimMeanPop,'----- Median Population=',medianPop)

Mean Population= 6162876.3 ----- Trimmed Mean Population= 4783697.125 ----- Median Population= 4436369.5
