# Welcome to WQD7003 Data Analytics Lab
This code is generated for the purpose of WQD7003 module.

Created by Shier Nee Saw

Reference: Python for Data Analysis O'Reily

# Summarizing and Computing Descriptive Statistics

pandas objects are equipped with a set of common mathematical and statistical methods.

Most of these fall into the category of reductions or summary statistics, methods that extract a single value (like the sum or mean) from a Series or a Series of values from the rows or columns of a DataFrame.

Compared with the equivalent methods of vanilla NumPy arrays, they are all built from the ground up to exclude missing data.


In [None]:
import pandas as pd
import numpy as np

df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                   [np.nan, np.nan], [0.75, -1.3]],
                  index=['a', 'b', 'c', 'd'],
                  columns=['one', 'two'])

df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [None]:
# Calling DataFrame’s sum method returns a Series containing column sums

df.sum()

one    9.25
two   -5.80
dtype: float64

In [None]:
# Passing axis=1 sums over the rows instead

df.sum(axis=1)

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

In [None]:
# NA values are excluded unless the entire slice (row or column in this case) is NA. This
# can be disabled using the skipna option

df.mean(axis=1, skipna=False)

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

In [None]:
df.mean(axis=1, skipna=True)

a    1.400
b    1.300
c      NaN
d   -0.275
dtype: float64

In [None]:
# Some methods, like idxmin and idxmax, return indirect statistics like the index value
# where the minimum or maximum values are attained:

df.idxmax()

one    b
two    d
dtype: object

In [None]:
# calculate the cumulative sum

df.cumsum()

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


In [None]:
# describe produces multiple summary statistics in one shot

df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


In [None]:
# if you have non-numeric data, describe produces alternative summary statistics.

obj = pd.Series(['a', 'a', 'b', 'c'] * 4)

obj.describe()


count     16
unique     3
top        a
freq       8
dtype: object

### Other methods

*  count
*  describe
*  min, max
*  argmin, argmax
*  idxmin, idxmax
*  quantile
*  sum
*  mean
*  median
*  mad
*  var
*  std
*  skew
*  kurt
*  cumsum
*  cummin, cummax
*  cumprod
*  diff
*  pct_change

Reference: https://pandas.pydata.org/pandas-docs/stable/reference/frame.html

# Correlation and Covariance

Some summary statistics, like correlation and covariance, are computed from pairs of arguments

In [None]:
# A good correlation depends on the use, but it is safe to say you have at least 0.6 (or -0.6) to call it a good correlation.

df = {
    "Array_1": [30, 70, 100],
    "Array_2": [65.1, 49.50, 30.7]
}

data = pd.DataFrame(df)

print(data.corr())

          Array_1   Array_2
Array_1  1.000000 -0.990773
Array_2 -0.990773  1.000000


In [None]:
# Making data frame from the csv file
# download nba.csv from ODL platform.
# upload to Google Colab: Refer to https://www.youtube.com/watch?v=I9zT-dC4Lw8&ab_channel=Dr.Vipin%27sClassroom
# after you have upload the file, run this cell

df = pd.read_csv("nba.csv")

# Printing the first 10 rows of the data frame for visualization
df[:10]

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0
5,Amir Johnson,Boston Celtics,90.0,PF,29.0,6-9,240.0,,12000000.0
6,Jordan Mickey,Boston Celtics,55.0,PF,21.0,6-8,235.0,LSU,1170960.0
7,Kelly Olynyk,Boston Celtics,41.0,C,25.0,7-0,238.0,Gonzaga,2165160.0
8,Terry Rozier,Boston Celtics,12.0,PG,22.0,6-2,190.0,Louisville,1824360.0
9,Marcus Smart,Boston Celtics,36.0,PG,22.0,6-4,220.0,Oklahoma State,3431040.0


In this context, we utilize the `corr()` function to compute correlations among the columns within the DataFrame employing the 'Pearson' method.

The DataFrame consists of only four numeric columns. The resulting DataFrame can be understood such that each cell represents the correlation between the row variable and the column variable.

It's worth noting that the correlation of a variable with itself is always 1.00, hence all diagonal values are 1.00.

In [None]:
# Correlation among the numerical columns using pearson method

df[['Number', 'Age', 'Weight', 'Salary']].corr(method='pearson')

Unnamed: 0,Number,Age,Weight,Salary
Number,1.0,0.028724,0.206921,-0.112386
Age,0.028724,1.0,0.087183,0.213459
Weight,0.206921,0.087183,1.0,0.138321
Salary,-0.112386,0.213459,0.138321,1.0


In [None]:
# Covariance among the numerical columns

df[['Number', 'Age', 'Weight', 'Salary']].cov()

Unnamed: 0,Number,Age,Weight,Salary
Number,254.916,2.019722,87.11377,-9418112.0
Age,2.019722,19.39536,10.12422,4910243.0
Weight,87.11377,10.12422,695.2895,18920380.0
Salary,-9418112.0,4910243.0,18920380.0,27344930000000.0


# Unique values, value counts and membership


In [None]:
# return the unique values in Series
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])

obj

0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object

In [None]:
uniques = obj.unique()
uniques

array(['c', 'a', 'd', 'b'], dtype=object)

In [None]:
# return the number of counts for unique values in descending order

obj.value_counts()

c    3
a    3
b    2
d    1
Name: count, dtype: int64

In [None]:
# if we do not want to sort the value_counts

pd.value_counts(obj.values, sort=False)

c    3
a    3
d    1
b    2
Name: count, dtype: int64

In [None]:
# isin is responsible for vectorized set membership and can be very useful in
# filtering a data set down to a subset of values in a Series or column in a DataFrame

mask = obj.isin(['b', 'c'])
mask

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

In [None]:
# return value that is True

obj[mask]

0    c
5    b
6    b
7    c
8    c
dtype: object

In [None]:
# In some cases, you may want to compute a histogram on multiple related columns in
# a DataFrame.

data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],
                     'Qu2': [2, 3, 1, 2, 3],
                     'Qu3': [1, 5, 2, 4, 4]})

data

Unnamed: 0,Qu1,Qu2,Qu3
0,1,2,1
1,3,3,5
2,4,1,2
3,3,2,4
4,4,3,4


In [None]:
# Calculate the number of counts for the unique values which are [1,2,3,4,5]
# For values that without any count, fill with zero
# We can achieve the above operation by passing pandas.value_counts to this DataFrame’s apply function gives

result = data.apply(pd.value_counts).fillna(0)
result

Unnamed: 0,Qu1,Qu2,Qu3
1,1.0,1.0,1.0
2,0.0,2.0,1.0
3,2.0,2.0,0.0
4,2.0,0.0,2.0
5,0.0,0.0,1.0


# Handling Missing Data

Missing data is common in most data analysis applications. One of the goals in designing pandas was to make working with missing data as painless as possible.

* dropna - Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.
* fillna - Fill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill'.
* isnull - Return like-type object containing boolean values indicating which values are missing / NA.
* notnull - Negation of isnull.

In [None]:
# pandas uses the floating point value NaN (Not a Number) to represent missing data in
# both floating as well as in non-floating point arrays

string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])

string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

In [None]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

In [None]:
# The built-in Python None value is also treated as NA in object arrays
string_data[0] = None

string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

In [None]:
print(string_data)

print('------------')
print(string_data.dropna())  # this do not change the original string_data but rather create a view data

print('------------')
print(string_data)  # notice here, the string_data still consists Null data

0         None
1    artichoke
2          NaN
3      avocado
dtype: object
------------
1    artichoke
3      avocado
dtype: object
------------
0         None
1    artichoke
2          NaN
3      avocado
dtype: object


In [None]:
# to change the original data by dropping na - perform assignment
print(string_data)

print('------------')
string_data = string_data.dropna()

print(string_data)  # notice here, the string_data do not consists Null data

0         None
1    artichoke
2          NaN
3      avocado
dtype: object
------------
1    artichoke
3      avocado
dtype: object


# Filtering Out Missing Data

You have a number of options for filtering out missing data. While doing it by hand is always an option, dropna can be very helpful.

On a Series, it returns the Series with only the non-null data and index values:

In [None]:
data = pd.Series([1, np.nan, 3.5, np.nan, 7])

data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [None]:
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

With DataFrame objects, these are a bit more complex. You may want to drop rows
or columns which are all NA or just those containing any NAs.

dropna by default drops any row containing a missing value

In [None]:
data = pd.DataFrame([[1., 6.5, 3.], [1., np.nan,  np.nan],
                     [np.nan, np.nan, np.nan], [np.nan, 6.5, 3.]])

data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [None]:
# dropna by default drops any row containing a missing value, default is how='any'
cleaned = data.dropna()

cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [None]:
# Passing how='any' will only drop rows that with any NA

data.dropna(how='any')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [None]:
# Passing how='all' will only drop rows that are all NA

data.dropna(how='all')


Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


In [None]:
# Dropping columns in the same way is only a matter of passing axis=1:

data[4] = np.nan
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [None]:
# drop columns with all NaN

data.dropna(axis=1, how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [None]:
# A related way to filter out DataFrame rows tends to concern time series data.
# Suppose you want to keep only rows containing a certain number of observations.
# You can indicate this with the thresh argument:

df = pd.DataFrame(np.random.randn(7, 3))
df.loc[:4, 1] = np.nan
df.loc[:3, 2] = np.nan

df

Unnamed: 0,0,1,2
0,-0.882505,,
1,0.484819,,
2,0.995109,,
3,-1.729005,,
4,-0.921851,,1.742706
5,-0.875147,1.270882,1.73607
6,1.862085,0.646274,-1.428189


In [None]:
df.dropna(thresh=3)  # only rows with at least 3 observations remained.

Unnamed: 0,0,1,2
5,-0.875147,1.270882,1.73607
6,1.862085,0.646274,-1.428189


# Filling in Missing Data

Rather than filtering out missing data (and potentially discarding other data along with it), you may want to fill in the “holes” in any number of ways.

For most purposes, the fillna method is the workhorse function to use.

Calling fillna with a constant replaces missing values with that value:

In [None]:
# fillna with zero
df.fillna(0)

Unnamed: 0,0,1,2
0,-0.882505,0.0,0.0
1,0.484819,0.0,0.0
2,0.995109,0.0,0.0
3,-1.729005,0.0,0.0
4,-0.921851,0.0,1.742706
5,-0.875147,1.270882,1.73607
6,1.862085,0.646274,-1.428189


In [None]:
# Calling fillna with a dict you can use a different fill value for each column:

# fill column 1 NaN with 0.5, column 2 NaN with -1
df.fillna({1: 0.5, 2: -1})  # Note: You did not change the original df, try print out df and check

Unnamed: 0,0,1,2
0,-0.882505,0.5,-1.0
1,0.484819,0.5,-1.0
2,0.995109,0.5,-1.0
3,-1.729005,0.5,-1.0
4,-0.921851,0.5,1.742706
5,-0.875147,1.270882,1.73607
6,1.862085,0.646274,-1.428189


In [None]:
# fillna returns a new object, but you can modify the existing object in place
# by using inplace=True

df.fillna(0, inplace=True)
df

Unnamed: 0,0,1,2
0,-0.882505,0.0,0.0
1,0.484819,0.0,0.0
2,0.995109,0.0,0.0
3,-1.729005,0.0,0.0
4,-0.921851,0.0,1.742706
5,-0.875147,1.270882,1.73607
6,1.862085,0.646274,-1.428189


In [None]:
# The same interpolation methods available for reindexing can be used with fillna

df = pd.DataFrame(np.random.randn(6, 3))
df.loc[2:, 1] = np.nan
df.loc[4:, 2] = np.nan

df


Unnamed: 0,0,1,2
0,-0.406278,0.560958,-1.144105
1,0.732506,-1.497506,0.483214
2,-0.455844,,0.68604
3,-0.574428,,-1.442555
4,1.186456,,
5,-0.185804,,


In [None]:
# ffill - Fill NA/NaN values by propagating the last valid observation to next valid
df.fillna(method='ffill')

Unnamed: 0,0,1,2
0,-0.406278,0.560958,-1.144105
1,0.732506,-1.497506,0.483214
2,-0.455844,-1.497506,0.68604
3,-0.574428,-1.497506,-1.442555
4,1.186456,-1.497506,-1.442555
5,-0.185804,-1.497506,-1.442555


In [None]:
# Fill NA/NaN values by propagating max two times
df.fillna(method='ffill', limit=2)

Unnamed: 0,0,1,2
0,-0.406278,0.560958,-1.144105
1,0.732506,-1.497506,0.483214
2,-0.455844,-1.497506,0.68604
3,-0.574428,-1.497506,-1.442555
4,1.186456,,-1.442555
5,-0.185804,,-1.442555


In [None]:
# With fillna you can do lots of other things with a little creativity. For example, you
# might pass the mean or median value of a Series

data = pd.Series([1., np.nan, 3.5, np.nan, 7])
data.fillna(data.mean())


0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

# Exercise
1. Load the dataset and calculate the mean, median, mode, and standard deviation of a specific column.
Dataset: Iris Dataset (https://archive.ics.uci.edu/ml/datasets/iris)
2.  Filter the dataset to show only rows where a specific column's value is above a certain threshold.
Dataset: Wine Quality Dataset (https://archive.ics.uci.edu/ml/datasets/wine+quality)
3. Identify and handle missing values in a dataset by filling them with the mean or median of the respective column.
Dataset: Titanic Dataset (https://www.kaggle.com/c/titanic/data)
4. Question: Count the number of unique values in a specific column of the dataset. Dataset: Bank Marketing Dataset (https://archive.ics.uci.edu/ml/datasets/bank+marketing)
5. Compute the minimum, maximum, and range of a numeric column in the dataset.
Dataset: Boston Housing Dataset (https://www.kaggle.com/c/boston-housing)
6. Find the count of unique values in each column of the dataset.
Dataset: Breast Cancer Wisconsin (Diagnostic) Dataset (https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic))

In [None]:
# Your solution here

### Submission: File > Print > As PDF > Submit in ODL Platform
### Make sure the answer is visible in PDF format.
### Deadline: 1 week after today class.