# [Practical Statistics for Data Scientists](https://www.oreilly.com/library/view/practical-statistics-for/9781491952955/)
by Peter Bruce, Andrew Bruce, & Peter Gedeck

## Chapter 1 - Exploratory Data Analysis

- Inference - set of procedures for drawing conclusions about a large population from small samples.

### Elements of Structured Data

- Data Types
  - Numeric - data expressed on a numeric scale.
    - Continuous - any value between an interval (interval, float, decimal)
    - Discrete - integer values (integer, count)
  - Categorical - can only take specific values (enums, factors, nominal)
    - Binary - 0 or 1, true or false
    - Ordinal - explicit ordering (ordered factor)
    - Categorical types are useful because they tell software how to process the data, allow for better storage and indexing, and enforce possible values.

### Rectangular Data
General term for a two-dimensional matrix with rows and columns. aka a spreadsheet or table.

- Data frame - Standard term for the basic data structure used in statistical and machine learning models.
  - Typically has one or more column that acts as an index.
- Feature - A column within a table (attribute, input, predictor, variable).
- Outcome - Features are somtimes used to predict the outcome of a project. (dependent variable, response, target, output)
- Record - A row within a table (case, example, instance, observation, pattern, sample)

### Nonrectangular Data Structures
These are more specialized structures

- Time series - forcasting data, IoT
- Spatial data structures - spatial coordinates, maps
- Graph (or network)

- initialize python environment and import dataset

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as sp

In [2]:
pd.set_option("display.max.columns", None)
pd.set_option("display.precision", 2)

In [3]:
df = pd.read_csv("data/nba_all_elo.csv", parse_dates=[5],
    dtype={
        "lg_id": "category",
        "_iscopy": "category",
        "seasongame": "category",
        "is_playoffs": "category",
        "team_id": "category",
        "fran_id": "category",
        "opp_id": "category",
        "opp_fran": "category",
        "game_location": "category",
        "game_result": "category"
    }
)
df.head()

Unnamed: 0,gameorder,game_id,lg_id,_iscopy,year_id,date_game,seasongame,is_playoffs,team_id,fran_id,pts,elo_i,elo_n,win_equiv,opp_id,opp_fran,opp_pts,opp_elo_i,opp_elo_n,game_location,game_result,forecast,notes
0,1,194611010TRH,NBA,0,1947,1946-11-01,1,0,TRH,Huskies,66,1300.0,1293.28,40.29,NYK,Knicks,68,1300.0,1306.72,H,L,0.64,
1,1,194611010TRH,NBA,1,1947,1946-11-01,1,0,NYK,Knicks,68,1300.0,1306.72,41.71,TRH,Huskies,66,1300.0,1293.28,A,W,0.36,
2,2,194611020CHS,NBA,0,1947,1946-11-02,1,0,CHS,Stags,63,1300.0,1309.65,42.01,NYK,Knicks,47,1306.72,1297.07,H,W,0.63,
3,2,194611020CHS,NBA,1,1947,1946-11-02,2,0,NYK,Knicks,47,1306.72,1297.07,40.69,CHS,Stags,63,1300.0,1309.65,A,L,0.37,
4,3,194611020DTF,NBA,0,1947,1946-11-02,1,0,DTF,Falcons,33,1300.0,1279.62,38.86,WSC,Capitols,50,1300.0,1320.38,H,L,0.64,


### Estimates of Location
A "typical value" for each variable (column), an estimate of where most of the data is located.

- Statisticians use the term *estimate* for a value calculated from a data set.
- Data scientists and business analysts use the term *metric*. Which is more concrete and for a practical use.
- Statisticians estimate, data scientists measure.

#### Mean (average)
sum of all values divided by the number of values
- Sensitive to the data (and extreme outliers).
- `values.sum() / values.count()`
- $\bar{x} = \frac{\sum_{i=1}^n x_i}{n}$
- $\bar{x}$ is pronounced "x-bar"

In [4]:
pts = df["pts"]
print("manual:", pts.sum() / pts.count())
print("mean():", pts.mean())

manual: 102.72998242475101
mean(): 102.72998242475101


#### Weighted Mean (weighted average)
sum of all values times a weight divided by the sum of the weights
- Some values are intrinsically more variable than others. These highly variable values should be given lower weights.
  - e.g. A less accurate sensor should be given lower weights.
- The data collected doesn't equally represent the different groups that are being measured.
- `([value] * [weights]).sum() / [weights].sum()`
- $\bar{x}_w = \frac{\sum_{i=1}^n w_i x_i}{\sum_{i=1}^n w_i}$

In [5]:
# add a weight column that weights years less the farther back in time they go
bins = [1900, 1950, 1960, 1970, 1980, 1990, 2000, 2010, 2025]
weights = [0.1, 0.2, 0.4, 0.5, 0.7, 0.8, 0.9, 1.0]
df["weight"] = pd.cut(df["year_id"], bins=bins, labels=weights, ordered=False).astype(float)

# calc weighted mean from new weight column
w_mean = (df["pts"] * df["weight"]).sum() / df["weight"].sum()
print("weighted mean pts:" , w_mean)

weighted mean pts: 102.2616841745508


#### Trimmed Mean (truncated mean)
mean after dropping a number of extreme values
- A Mean value that isn't as sensitive to the data and extreme outliers.
- ([values].sort_values()[trim : n - trim]).mean()
- $\bar{x}_{(r)} = \frac{1}{n - 2r} \sum_{i = r+1}^{n-r} x_{(i)}$

In [6]:
pts_sorted = df["pts"].sort_values()
n = len(pts_sorted)
trim = int(0.1 * n) # trim 10%
trimmed = pts_sorted.iloc[trim:n-trim]
print("trimmed mean pts:", trimmed.mean())

# using scipy
print("trimmed with scipy:", sp.trim_mean(df["pts"], 0.1))

trimmed mean pts: 102.71580968214384
trimmed with scipy: 102.71580968214384


#### Median (50th percentile)
The value such that one-half of the data lies above and below
- Less sensitive to the data, less affected by outliers. This is referred to as a *robust* estimate.

In [7]:
print("median:", df["pts"].median())

median: 103.0


#### Weighted Median
The value such that one-half of the sum of the weights lies below and above the sorted data

In [8]:
weights_cs = np.cumsum(df["weight"]) # cumulative sum
cutoff = 0.5 * df["weight"].sum()
median_value = pts_sorted.to_numpy()[weights_cs >= cutoff][0]
print("weighted pts median:", median_value)

weighted pts median: 107


#### Percentile (quantile)
The value such that P percent of the data lies below