In [None]:
# running this on Ubuntu
# !sudo apt install python3-pip
# !pip install fastai

## pre requisites

### Median and Mode in Statistics

Sure, let's break down these two statistical measures:

1. **Median**: The median is the middle value in a dataset when the numbers are sorted in ascending or descending order. If there is an odd number of observations, the median is the middle number. If there is an even number of observations, the median is the average of the two middle numbers. The median can be a more useful measure than the mean (average) when dealing with data that is skewed or has extreme values, as it is not affected by these outliers.

2. **Mode**: The mode is the value that appears most frequently in a dataset. A dataset may have one mode, more than one mode, or no mode at all. For example, in the dataset {2, 3, 4, 4, 5}, the mode is 4 because it appears twice, more than any other number. If the data set is {2, 3, 4, 4, 5, 5}, it has two modes (4 and 5), making it a bimodal dataset. In a dataset where no number repeats, such as {1, 2, 3, 4, 5}, there is no mode.

In summary, the median and mode are both measures of central tendency, used to understand the distribution of a dataset, but they measure different aspects. The median gives us a mid-point of the data, while the mode indicates the most frequently occurring value(s).

#### practical use-case of median

Let's consider a real-world scenario in which the median is a more meaningful measure of central tendency: house prices in a neighborhood.

Suppose you're a real estate agent or a home buyer interested in the typical price of houses in a particular neighborhood. The homes in this neighborhood vary greatly in price due to differences in size, age, condition, and proximity to local amenities.

Here's why the median would be particularly useful in this scenario:

1. Let's say there are 101 houses in this neighborhood. To find the median price, you would list all the house prices in ascending order and find the middle value. If there's an even number of houses, the median would be the average of the two middle numbers.

2. The median gives you the middle point of house prices, which means that half of the houses are priced below this point and half are priced above it. This can be very useful information for a prospective buyer or a real estate agent to understand the distribution of house prices in that neighborhood.

3. One of the advantages of the median in this context is that it is not affected by extremely high or low values (outliers). For instance, if most houses are priced between $200,000 and $300,000, but a few luxury homes are priced at $2,000,000, the mean (average) house price would be significantly higher than what a typical house in the neighborhood costs. However, the median would not be skewed by these few high-end homes, providing a more accurate picture of what a typical buyer might expect to pay.

So in this scenario, using the median rather than the mean would provide a more meaningful representation of the central tendency of house prices.

#### practical use-case of mode

Sure, let's consider an example involving a shoe store to demonstrate a scenario where the mode is meaningful.

Suppose you are a store manager for a shoe store. Understanding your sales and customers' preferences is crucial for managing your inventory effectively. 

One simple piece of data you might be interested in is the most commonly sold shoe size. This is where the mode becomes useful. 

Here's why:

1. If you look at your sales data for the past year, you might have sold shoes in sizes ranging from 5 to 13. To find the mode, you would identify which shoe size appears most frequently in your sales data.

2. Suppose you find that size 9 is the mode - it appears more frequently than any other size. This tells you that size 9 shoes are the most commonly sold in your store.

3. This information can help guide your inventory decisions. For example, you might want to make sure you order more size 9 shoes than other sizes to meet the demand.

In this scenario, the mode provides a meaningful insight that can directly impact business decisions. It's worth noting that while the mean or median shoe size might also be interesting, they would not provide the same level of practical, actionable insight.


In [None]:
from fastai.tabular.all import *

# downloaded the [Census Income Data Set](https://archive.ics.uci.edu/ml/datasets/Adult)
# based on a pre defined constant in fastai
path = untar_data(URLs.ADULT_SAMPLE)
# showing, where on the machine the data is stored
path.ls()

In [None]:
# having a look at a sample of the data
df = pd.read_csv(path/'adult.csv')
df.head()


In [None]:
# all of the categorical variables of this dataset (such as workclass, education, marital-status, etc.)
# will have a unique index,
# whereas contiguous fields (like age, for instance) will be treated as simple float numbers;
# this is what we specify our data loader

# `y_names` is the name of the dependent variable (the one we want to predict)
dls = TabularDataLoaders.from_csv(path/'adult.csv', path=path, y_names="salary",
    cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
    cont_names = ['age', 'fnlwgt', 'education-num'],
    procs = [Categorify, FillMissing, Normalize])

In the above code, `procs` is a list of data preprocessing steps to be applied to the input data before training the model. 

- `Categorify` is a preprocessing step that converts categorical variables into numerical representations.
- `FillMissing` is a preprocessing step that fills in missing values in the dataset.
  - in practice, this means replacing missing value with the median (for continuous fields) or mode (for categorical fields) value of the column.
- `Normalize` is a preprocessing step that normalizes the continuous variables in the dataset to have a mean of 0 and a standard deviation of 1.
  - "Normalization" in the context of data preprocessing is a technique used to standardize the range of independent variables or features of data (e.g. things that are difficult to compare in place because of their different scales). This can make the dataset easier to work with and can help machine learning algorithms perform better.
  - Normalizing a dataset to have a mean of 0 and a standard deviation of 1 is often called "standardization" or "z-score normalization".
    - Mean of 0: When we say that the normalized data has a mean of 0, we mean that if you add up all the values and divide by the number of values (which is how you calculate the mean), the result will be 0. Essentially, the positive and negative numbers balance each other out.
    - Standard deviation of 1: The standard deviation is a measure of how spread out the numbers in the data are. If the data is tightly clustered around the mean, the standard deviation is small, and if the data is spread out over a large range of values, the standard deviation is large. When we say the normalized data has a standard deviation of 1, we're saying that the spread of the data has been adjusted so that it corresponds to a certain defined range, one that spreads the value evenly around the mean (the data is said to be `centered`).
  - The reason we do this is to put different variables on an equal footing before we run a machine learning algorithm. For example, if you have one variable that is in the range of 1 to 10 and another that is in the range of 1 to 1,000,000, the algorithm might end up giving too much weight to the larger variable simply because of its scale. By standardizing both variables to have a mean of 0 and standard deviation of 1, we ensure they both can have an equal impact on the algorithm's result. In other words, normalization is a way to make different types of data comparable, so that no single type of data overpowers the others when we're trying to find patterns or make predictions.

By applying these preprocessing steps, the data is transformed into a format that can be used by the model for training.

In [None]:
# now, we'll do the same thing, but our goal is to split the data between training and validation sets

# we'll use this variable to do a random 80/20 split of the data
# using the `TabularPandas` class
splits = RandomSplitter(valid_pct=0.2)(range_of(df))

# to = TabularPandas(df, procs=[Categorify, FillMissing, Normalize],
#                    cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
#                    cont_names = ['age', 'fnlwgt', 'education-num'],
#                    y_names='salary',
#                    splits=splits)

# using a prefixed seeding (useful for debugging the training)
to = TabularPandas(df, procs=[Categorify, FillMissing, Normalize],
                   cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
                   cont_names = ['age', 'fnlwgt', 'education-num'],
                   y_names='salary',
                   splits=splits)

# check out how the data is processed now (returning the first 2 rows)
to.xs.iloc[:2]

In [None]:
#  this is our new data loader
# dls = to.dataloaders(bs=64)

# setting a prefixed seed (useful for the reproductibility of the training results)
dls = to.dataloaders(bs=64, set_seed=42)

dls.show_batch()

In [None]:
# this learner will infer the loss function based on the earlier specified `y_names` variable
learn = tabular_learner(dls, metrics=accuracy)

learn.fit_one_cycle(1)

In [None]:
# let's have a look at some predictions on a give row of the dataset
row, clas, probs = learn.predict(df.iloc[0])
row.show()

In [None]:
# let's see what's in `clas` and `probs` variable
# `clas` refers to the predicted class label (in this case, the salary is > 50k)
# `probs` refers to the probability of the predicted class label
clas, probs

In [None]:
# getting predictions on a new data frame by copying the original one
# and also dropping the `salary` column (which is the one we want to predict)
# to be able to make predictions without the learner knowing the actual value
test_df = df.copy()
# `axis=1` means we're dropping a column (and not a row)
test_df.drop(['salary'], axis=1, inplace=True)
dl = learn.dls.test_dl(test_df)
learn.get_preds(dl=dl)