# Tutorial 2 (Week 2) - Descriptive Statistics and Hypothesis Testing

## Learning Objectives

After completing this tutorial, you should be able to:

+ Manipulate NumPy and Pandas data structures for statistics computation
  + Group the dataset by variable values
  + Filter the dataset for specific variable values
+ Compute descriptive statistics for a dataset
  + Compute statistics on arrays, Series, and DataFrame
  + Apply statistical measures for decision making
+ Fit a probability distribution to a dataset and estimate the parameters using SciPy
+ Perform hypothesis testing using SciPy

# Preface: Handling Data in Pandas and NumPy

In Tutorial 1, we have used Pandas DataFrames and NumPy arrays in creating visualizations. We will now look at these data structures in more details so that we can perform more advanced operations with our data.

_Tips:_ Throughout the tutorial, we will encounter various functions and properties in Pandas and NumPy. It is recommended that you make it a habit to look up the documentation (API reference and usage examples) of those you are not yet familiar with. This way, you form a better understanding of how you can use them in future, beyond the example problems in this tutorial.

In Jupyter Notebook, the references are conveniently linked under the `Help` menu.

In [1]:
import numpy as np
import pandas as pd

## NumPy Arrays

NumPy is a fundamental package for scientific computing in Python. The main object in NumPy is `ndarray`, also known by the alias `array`. 

Numpy `array` is a table of elements (usually numbers), which is:
- _Homogeneous_: elements are all of the same type;
- _Multi-dimensional_: elements can be arranged into more than one __axes__;
- _Indexed_: elements are addressable by a tuple of integers, one on each axis.

The following image ([source](https://predictivehacks.com/tips-about-numpy-arrays/)) illustrates the array structure.

<img src="https://predictivehacks.com/wp-content/uploads/2020/08/numpy_arrays-1024x572.png" width="500">

Note that `numpy.array` is not the same as the Standard Python Library class `array.array`, which only handles one-dimensional arrays.

Sample basic operations on numpy arrays are given below. Try running the codes and make sure you understand the output. You can go to [NumPy quickstart](https://numpy.org/doc/stable/user/quickstart.html) for more examples and practice.

There are various ways to __create arrays__.

In [2]:
# One of many ways to create an array
a = np.arange(24).reshape(4, 3, 2)
a

array([[[ 0,  1],
        [ 2,  3],
        [ 4,  5]],

       [[ 6,  7],
        [ 8,  9],
        [10, 11]],

       [[12, 13],
        [14, 15],
        [16, 17]],

       [[18, 19],
        [20, 21],
        [22, 23]]])

Take note of the array notation, and compare the above array display with the conceptual illustration to understand how it is represented in NumPy.

Some __properties:__

In [3]:
a.shape

(4, 3, 2)

In [4]:
a.ndim

3

In [5]:
a.size

24

There are several methods to create arrays with __initialed content__.

In [6]:
np.zeros( (3,5) )

array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])

In [7]:
np.full( (3,5), 0.25 )

array([[0.25, 0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25, 0.25]])

In [None]:
np.random.rand( 3, 5 )

__Indexing__ operation retrieves the array element at `index`. For multi-dimensional arrays, `index` is a comma-separated tuple with one component for each axis. A negative component means counting from the last element on the axis.

In [None]:
a[3,2,1]

In [None]:
a[3,-2,1]

__Slicing__ operation takes a range `start:stop` and returns a contiguous subset of array elements from `start` (_inclusive_) to `stop` (_exclusive_). Leaving `start` blank means slicing from the first element, while leaving `stop` blank means slicing until and including the last element.

The slicer can take an optional third argument, making it `start:stop:step`, to set the interval at which elements are included in the slice. A negative step reverses the direction of the stepping.

In [None]:
a[3,1:2,1]

In [None]:
a[3,1:,1]

In [None]:
a[::2,:,1]

Conceptually, indexing returns an _element_ of the array, while slicing returns a _subset_ of the array. 

- A subset of an array is always another array. 

- An element of a 1D array is simply a value; an element of a 2D array is a 1D array, and so on. 

The indexing operation essentially narrows down to the element axis by axis, from axis 0 upwards. 

We can omit index components for higher axes (that is, stop narrowing down at one point) to retrieve all elements on remaining axes. This is equivalent to getting a complete slice of all remaining axes.

In [8]:
a[3,2]

array([22, 23])

In [9]:
a[3,2,:]

array([22, 23])

In [None]:
a[3]

In [None]:
a[3,:,:]

__Try this out:__ 

- Can you omit index components for lower axes (while specifying one or more higher axes index)? What elements do this retrieve? How is it different from slicing?

- Can you omit index components for arbitrary axes (while specifying other axes index)? What elements do this retrieve? How is it different from slicing?

In [None]:
# Try!

__Arithmetic operations__ on arrays are applied _element-wise_.

In [None]:
a*2

In [None]:
a < 7

## Pandas Series and DataFrame

Pandas data table representation is __DataFrame__, a 2-dimensional data structure that can store data of different types in columns.

<img src = "https://pandas.pydata.org/pandas-docs/stable/_images/01_table_dataframe.svg" width="360">

Each column in a DataFrame is a __Series__, a one-dimensional labeled array consisting of _index_ (the axis label) and data values. A Series object has a single data type, which can be any supported [`dtype`](https://pandas.pydata.org/docs/user_guide/basics.html#basics-dtypes) (integers, strings, Python objects, etc.).

<img src = "https://pandas.pydata.org/pandas-docs/stable/_images/01_table_series.svg" width="120">

Sample basic operations on Series and DataFrame are given below. Try running the codes and make sure you understand the output. You can refer to [Pandas User Guide on data structures](https://pandas.pydata.org/docs/user_guide/dsintro.html) for more comprehensive guides.


### Series Operations

There are various ways to __create a Series__, such as from a NumPy array. We can specify the index or leave it as the default integer-based index.

In [None]:
s = pd.Series( np.random.rand(5), index=["a", "b", "c", "d", "e"] )
s

The concepts of indexing, slicing, and element-wise arithmetic operations also apply to Series, with some differences in application compared to NumPy arrays. Various [indexing methods](https://pandas.pydata.org/docs/user_guide/indexing.html) are supported.

__Indexing__ may use the index label or the integer position.

In [None]:
s['a']

In [None]:
s[0]

The `Series.get()` method avoids throwing error for invalid labels.

In [None]:
s.get('d')

In [None]:
s.get('f')  # What happens?

In [None]:
print( s.get('f') )

__Slicing__ with a range is applied on the Series data as well as the index.

In [None]:
s[2:4]

In [None]:
s[::2]  # with step

As Series has index labels, slicing can be done using labels as well, but the behaviour is different. Try this out.

In [None]:
s['c':'e']

Unlike with NumPy arrays, __arbitrary non-contiguous slicing__ is possible. We can specify a list of labels.

In [None]:
# Note the inner [] for list notation
s[['a','b','d']]

We can also pass a boolean Series to pick elements corresponding to `True`-valued labels.

In [None]:
print( "Median:", s.median() )

# The > operation is applied element-wise, resulting in a Series of boolean values
s[ s > s.median() ]

### DataFrame Operations

Most of the time, the DataFrames we work with are the results of loading actual datasets. There are also other ways of __creating a DataFrame__, such as from a dict of Series.

In [None]:
d = {
    "one": pd.Series( np.random.rand(5), index=list('abcde') ),
    "two": pd.Series( ['Alice', 'Bob', 'Carol'], index=list('abc') )  # different length
}
# Observe how the different Series lengths are handled
df = pd.DataFrame(d)
df

__Select DataFrame columns__ by column name. The result is a Series, retaining the index. The column name is stored in the `name` attribute of the Series.

In [None]:
df["one"]

__Select DataFrame rows__ by the row label using `df.loc`, or by the integer location using `df.iloc`. The result is also a Series, with the column names serving as index. The row label is stored in the `name` attribute of the Series.

In [None]:
df.loc['c']

In [None]:
df.iloc[2]

We can __add a new column__ to the DataFrame, e.g., from a Series. 

New columns are added at the end by default. `DataFrame.insert()` can be used to insert a column at a particular location.

In [None]:
df["flag"] = pd.Series( np.ones(5), index=list('abcde') )
df

In [None]:
df.insert( 2, "three", df["flag"] - df["one"] )
df

We can also __set the value of an existing column__.

In [None]:
df["two"] = ['Alice', 'Bob', 'Carol', 'Dale', 'Eva']
df

__Delete a column__ with the Python `del` function. To remove the column but keep the data as a separate Series, __pop the column__ using `DataFrame.pop` instead.

In [None]:
del df["flag"]
df

In [None]:
names = df.pop( "two" )

print( "Names:\n", names )
df

__`DataFrame.assign()`__ is a useful method to create new columns (potentially derived from existing columns) in a copy of the data, leaving the original DataFrame untouched.

In [None]:
dfcopy = df.assign( four = df["one"] * df["three"] )
dfcopy

In [None]:
df

We can __rename columns__ using a mapping. The index (row labels) can similarly be renamed. Note that the rename operation returns a new DataFrame.

In [None]:
dfcopy = dfcopy.rename( columns={"three" : "factor", "four" : "product"} )  # overwrite the existing DataFrame
dfcopy

## Interoperability of Pandas and NumPy Data Structures

Most NumPy functions can be called directly on Series and DataFrame.

In [None]:
np.exp( s )

In [None]:
np.square( df )

As we have seen in Tutorial 1, however, some functions will require NumPy arrays. We can __convert a Series or a DataFrame into a NumPy array__ using `Series.to_numpy` or `DataFrame.to_numpy` functions respectively. With heterogenous data, the lowest common type will have to be used.

In [None]:
s.to_numpy()

In [None]:
df.to_numpy()

Note that DataFrame is not intended to be a drop-in replacement for NumPy array as its indexing semantics and data model are quite different in places from an n-dimensional array.

In situations when we can use either Pandas or NumPy functions, we may consider factors such as the speed or memory consumption, as summarized in this [Pandas vs Numpy comparison table](https://www.knowledgehut.com/blog/data-science/pandas-vs-numpy#pandas-vs-numpy-[comparison-table]).

# Introduction to SciPy

The SciPy package contains various toolboxes dedicated to common issues in scientific computing. 

Although there are basic statistical functions (mean, mode, etc.) that can be applied directly to [Series](https://pandas.pydata.org/docs/reference/series.html#computations-descriptive-stats), [DataFrames](https://pandas.pydata.org/docs/reference/frame.html#computations-descriptive-stats), and [NumPy arrays](https://numpy.org/devdocs/reference/routines.statistics.html), the real repository for statistical functions is in [`scipy.stats`](https://docs.scipy.org/doc/scipy/reference/stats.html).

Let us work with a dataset next, to see how we can use the various packages to compute the statistics. We will also use some Matplotlib plotting functions along the way, so let's import that too.

In [None]:
import scipy.stats as stats

%matplotlib inline
import matplotlib.pyplot as plt

# Descriptive Statistics

For this tutorial, we will use the [food delivery time dataset](https://www.kaggle.com/datasets/bhanupratapbiswas/food-delivery-time-prediction-case-study).

__EXERCISE:__ 

Load the dataset from the file `food_delivery_time.xlsx`. What Pandas function should you use to read the Excel format?

In [None]:
# TODO

__EXERCISE:__ 

Rename the column names to make them easier to handle:
```
Delivery_person_*   --> Rider_*
Delivery_location_* --> Customer_*
Type_of_order       --> Order_Type
Type_of_vehicle     --> Vehicle
Time_taken(min)     --> Time
```

In [None]:
# TODO

Upon inspection, the values in the Order and Vehicle columns contain trailing whitespace, which may hinder our work later on.

In [None]:
delivery["Order_Type"][0]

In [None]:
delivery["Vehicle"][0]

__EXERCISE:__ 

Trim the whitespace from all values in these two columns and store them back in the same DataFrame. Look up the function from `Series.str` package that you can use for this.

In [None]:
# TODO

In [None]:
delivery["Order_Type"][0]  # check after replacement

In [None]:
delivery["Vehicle"][0]  # check after replacement

__EXERCISE:__ 

What is the quickest way to find out (1) age range of all Riders, (2) mean Ratings across all orders, and (3) median of delivery time, directly on this DataFrame?

_Hint:_ You have done a similar task in Tutorial 1.

In [None]:
# TODO

The `DataFrame.describe` function is handy to quickly obtain basic statistics for all _numeric_ variables. These statistics are generated by excluding the missing values in the data, and the `count` value shows how many values in the column are used in the computation. As such, `count` is also a good indicator of the presence of missing values. 

Observe the output above and check whether this dataset have any missing values. (We will see how to deal with this situation when we learn Data Preprocessing later on in this course.)

Of course, we can obtain the standalone statistical measures (count, mean, std, and so on) as well. You can refer to the documentations for the functions.

__EXERCISE:__

Using standalone functions, obtain the answer to the three questions above.

In [None]:
# TODO: Find the age range of all Riders

In [None]:
# TODO: Find the mean Ratings across all orders

In [None]:
# TODO: Find the median of delivery time

__Categorical variables__ are excluded from `DataFrame.describe`. As we would expect, measures like mean or median are not sensible for such variables. 

What statistics apply to categorical variables? These are count-related measures such as frequencies and proportion.

__EXERCISE:__ 

Find out the most popular order type in this dataset.

In [None]:
# TODO

Check if your answer is correct by comparing it with the counts of all variables below. 

Suppose we want to optimize the delivery operations. Looking at all the counts, is the mode useful to make a decision on which order type to focus on?

In [None]:
delivery["Order_Type"].value_counts()

`scipy.stats` has a `describe` function as well, which works on _a single numeric variable_ and gives a slightly different set of statistical measures.

In [None]:
stats.describe( delivery["Rider_Ratings"] )

## Grouping and Filtering

Most of the time, statistics over the whole dataset do not give us a lot of meaningful information. For instance, not many useful conclusions can be drawn just by knowing the mean delivery time of all orders, given the variation in delivery distances and delivery vehicles. We would typically want to filter or group the dataset by values or conditions on the variables.

### Groupby

We can use `DataFrame.groupby()` function to split the dataset based on values of a specific variable. This is typically done to obtain aggregate statistics for the resulting groups (as what we want to do here), or other useful functionalities in the _split-apply-combine_ framework for data analysis as explained in [this Pandas guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html). 

The following image ([source](https://www.justintodata.com/pandas-groupby-with-python/)) illustrates the groupby mechanism.

<img src="https://www.justintodata.com/wp-content/uploads/2020/04/image-4.png">

Note that no actual splitting is done when the GroupBy object is created. The function only verifies that we have passed a valid mapping. The splitting is only done when we explicitly use some method on this object or extract some of its attributes, such as `groups`.

In [None]:
delivery.groupby( ["Order_Type", "Vehicle"] ).groups

We can also view data in a specific group using `get_group`.

In [None]:
delivery.groupby( ["Order_Type", "Vehicle"] ).get_group( ("Drinks", "motorcycle") )

As mentioned, however, we typically want to apply some operation to the GroupBy object.

__EXERCISE:__ 

Find the median delivery times for this dataset, grouped by vehicle types. 

In [None]:
# TODO

### Filtering Rows

Another useful operation is filtering rows, which is usually done as an indexing operation as we learned previously. We specify as index, a condition involving the values in certain column, e.g.:

- `delivery["Vehicle"] == "scooter"` or
- `delivery["Rider_Age"] > 25`

These are arithmetic expressions which will be evaluated element-wise on the columns `delivery["Vehicle"]` or `delivery["Rider_Age"]`, resulting in a Series of boolean values: `True` for rows fulfilling the condition and `False` otherwise. Indexing `delivery` using this boolean Series will select only rows corresponding to `True` values.

__EXERCISE:__ 

An order just came in and we are looking for riders on scooters to assign it to. How many riders can we choose from?

_Hint:_ After filtering, what is the column we should be looking at? What Series function can you use to obtain the unique count?

In [None]:
# TODO

__EXERCISE:__ 

How many of these scooter riders have achieved ratings higher than the median of this group?

In [None]:
# TODO

__EXERCISE:__ 

What is the highest ratings that scooter riders have achieved? Show the IDs of all scooter riders who have achieved it.

In [None]:
# TODO

As a final exercise in this section, suppose we pick any scooter rider who have achieved ratings higher than the median to deliver all orders. Consider how we may satisfactorily estimate a delivery time to inform the customers.

__EXERCISE:__

First, let's try to obtain the summary statistics for the delivery time of this group.

In [None]:
# TODO

If the mean and median (equal to the 50% percentile) values are close, we might think it would be a reasonable estimate to provide. But what will happen if we give this estimate to all customers? Clearly about half the customers would be upset because they would not receive their food within this time.

So should we take the maximum delivery time, and provide a guaranteed delivery time of `max`? We may soon lose many customers as most people would not want to wait that long for their food.

Instead, looking at the delivery time distribution, using a histogram, may guide our estimate better.

__EXERCISE:__

Use the `Series.hist` function to draw the histogram of delivery time.

In [None]:
# TODO

Using the histogram, we can verify the time duration in which most orders can be delivered. On the rare occasions that it takes longer, it could be because the restaurants are overcrowded, or the traffic jams are worse than usual.

We could then determine a reasonable time range to provide to customers, and inform them that it might take longer in certain unexpected situations.

# Probability Distributions with Scipy

A _probability distribution_ describes phenomena that are influenced by random processes: naturally occurring random processes; or uncertainties caused by incomplete knowledge.

The outcomes of a random process are called a _random variable, X_. The _distribution function_ maps probabilities to the occurrences of X.

SciPy counts 104 continuous and 19 discrete distributions that can be instantiated in its `stats.rv_continuous` and `stats.rv_discrete` classes. Discrete distributions deal with countable outcomes, such as customers arriving at a counter. Continuous distributions compute the probability of occurrences between two outcomes or points on the x-axis, such as variations in height, temperature, or time.

Distributions related to engineering and technology, which attempt to model, for instance, the lifetime or time to failure of equipment, as well as in biology and pharmaceutics, have blossomed in recent years, driven by the fast increasing availability of sensor data and other large sources of quantifiable information.

## Normal Distribution

The normal distribution, also called the Gaussian distribution, is a continuous probability distribution that is symmetric around its mean. It is arguably the most famous distribution due to its mathematical properties and its ability to describe many natural phenomena. It is typically great for mapping population data, for example, household income distribution. 

The normal distribution is characterized by two parameters: 
- the _mean (μ)_, which represents the central tendency of the distribution; and
- the _standard deviation (σ)_, which measures the spread or dispersion of the data.
By knowing these two parameters, we can fully describe a normal distribution.

#### Distribution Fitting

Fitting a normal distribution to a dataset allows us to estimate these parameters from the dataset. The `norm.fit` function in `scipy.stats` takes an array-like object as input and returns the maximum likelihood estimates (MLE) for the mean and standard deviation of the underlying distribution.

Let us fit a normal distribution to our observed dataset of delivery time. To use the SciPy function, we need this data in NumPy array format.

__EXERCISE:__

Convert the delivery time data into a NumPy array. 

In [None]:
# TODO

__EXERCISE:__

Now use the `norm.fit` function to fit the normal distribution to the data. Note that it returns two values, corresponding to the two parameters.

In [None]:
# TODO

The mean is an estimator of the center of the distribution. The normal distribution distribution fit by SciPy should have the same center as the mean of the sample, as we have created a sample distribution.

__EXERCISE:__

Compute the mean of the sample array directly using NumPy, and verify that it has the same center.

In [None]:
# TODO

# Hypothesis Testing using SciPy

A statistical test is a decision indicator. For instance, if we have two sets of observations, that we assume are generated from Gaussian processes, we can use a t-test to decide whether the means of two sets of observations are significantly different. 

Supppose we want to determine whether the average delivery time using motorcycles is significantly different from the mean delivery time of 26 minutes.

__EXERCISE:__

Obtain the NumPy array containing delivery time data for motorcycles.

In [None]:
# TODO

Consider the null hypothesis that the expected value (mean) of the motorcycles delivery time samples is equal to the given population mean. The `stats.ttest_1samp` function does two-sided test and returns the t-statistic and p-value.

__EXERCISE:__

Run the `ttest_1samp` function on the motorcycles delivery time data and given population mean. Let the significance value be 0.05. Can we reject the null hypothesis?

In [None]:
# TODO

Now let us investigate whether the mean delivery time using motorcycles is significantly different from the mean delivery time using bicycles.

__EXERCISE:__

Obtain the NumPy array containing delivery time data for bicycles.

In [None]:
# TODO

We can use the `stats.ttest_ind` function to do a two-sided test for the null hypothesis that the two independent samples have identical average (expected) values.

__EXERCISE:__

Run the `ttest_ind` function on the motorcycles delivery time data and the bicycle delivery time data. Let the significance value be 0.05. Can we reject the null hypothesis?

In [None]:
# TODO