### DS102 | Self-Study Week 1A - Pandas & Numpy II
<hr>
## Learning Objectives
At the end of this self-study, you will be able to:

- from a `list` of `dict`, construct a `DataFrame`

- read a CSV file into a `DataFrame` with a specified `sep` parameter

- use `DataFrame.fillna()` to substitute any `NaN` values

- use `DataFrame.dropna()` to substitute any rows with `NaN` values

- use `DataFrame.count()` to find out the number of records

- Retrieve one or a range of records from a `DataFrame`

- Retrieve one or a range of records from a `Series`

- Retrieve multiple records from a `DataFrame` using a list of indices

- Update a value of in a `Series`

- Drop columns from a `DataFrame` / Select columns from a `DataFrame`

- Sort values in a `DataFrame` based on a column

### Datasets Required for this Self-Study
1. `wines-5k.csv`

#### Import `pandas`, `numpy` 

In [3]:
import pandas as pd
import numpy as np

### Create a `DataFrame` from a `list` of `dict`
Besides instantiating a `DataFrame` from a CSV file, a `DataFrame` can also be instantiated other ways. Consider the following `list` of `dict`.

In [4]:
video_ads = [
    {"title": "Healthy Living", "views": 15934},
    {"title": "Get a ride, anytime anywhere", "views": 923834},
    {"title": "Send money to your friends with GrabPay", "views": 23466},
    {"title": "Ubereats now delivers nationwide", "views": 1337},
    {"title": "Facebook is hiring data scientists!", "views": 15934},
]

Convert the `list` to a `DataFrame` using `pd.DataFrame(video_ads)`.

In [5]:
videos_df = pd.DataFrame(video_ads)

Add a parameter inside the `df.head()` function e.g. `head(3)` to show the first 3 records in the `df.`

In [4]:
# Show only the first 3 records in videos_df
videos_df.head(3)

Unnamed: 0,title,views
0,Healthy Living,15934
1,"Get a ride, anytime anywhere",923834
2,Send money to your friends with GrabPay,23466


### Create a `DataFrame` from a `list` of `list`, using `columns`
A `DataFrame` can also be instantiated from a `list` of `list`s. Consider these 2 lists, one containing channel names and the other containing the number of videos in each channel.

In [6]:
channels_and_videos = [['Apple', 2], ['Facebook', 1], ['Hackwagon', 5], ['Grab', 3], 
                       ['Sony', 1], ['Subway', 1], ['Netflix', 1], ['Uber', 6], ['Health Promotion Board', 2]]

Convert this into a `DataFrame` using the lists and the `columns` parameter. Note that the parameter `columns` takes in a `list`, representing each header in the `df`.

In [7]:
channels_df = pd.DataFrame(channels_and_videos, columns=['channels', 'number_of_videos'])

In [8]:
# Show the first 5 records in a Series
channels_df['channels'].head()

0        Apple
1     Facebook
2    Hackwagon
3         Grab
4         Sony
Name: channels, dtype: object

In [9]:
# Sample (without replacement) 2 records from the Series
channels_df['channels'].sample(2)

0     Apple
5    Subway
Name: channels, dtype: object

### Sorting using `sort_values`
Use `DataFrame.sort_values()` to sort the `df` based on a selected column.

In [10]:
channels_df.sort_values('channels')

Unnamed: 0,channels,number_of_videos
0,Apple,2
1,Facebook,1
3,Grab,3
2,Hackwagon,5
8,Health Promotion Board,2
6,Netflix,1
4,Sony,1
5,Subway,1
7,Uber,6


To sort in descending order, specify the parameter `ascending=False`. Usually, after this, `reset_index()` is also called.

In [14]:
# To store the sorted df, assign the df to a new variable.
channels_df_desc = channels_df.sort_values('number_of_videos', ascending=False)
channels_df_desc.reset_index(drop=True, inplace=True)
channels_df_desc

Unnamed: 0,channels,number_of_videos
0,Uber,6
1,Hackwagon,5
2,Grab,3
3,Apple,2
4,Health Promotion Board,2
5,Facebook,1
6,Sony,1
7,Subway,1
8,Netflix,1


### Filtering, `isin()`, OR and NOT conditions

Use the following notation to filter from a `df` for records **NOT** specifying the condition. Note that the condition needs to be wrapped around round parenthesis `()` and add a tilde `~` before this condition.
```python
df[~(df['column'] <conditional operator> <value>)]
```

In [11]:
# Filter for all videos NOT having views greater than 20000 (or less than 20000 views)
videos_df[~(videos_df['views'] > 20000)]

Unnamed: 0,title,views
0,Healthy Living,15934
3,Ubereats now delivers nationwide,1337
4,Facebook is hiring data scientists!,15934


To filter where a record satisfies `CONDITION_1` **OR** `CONDITION_2`, use a pipe `|` between the conditions. Contrast this with using a `&` between conditions.

In [12]:
# Filter for channels that have 1 video OR at least 5 videos.
channels_df[(channels_df['number_of_videos'] == 1) | (channels_df['number_of_videos'] >= 5)]

Unnamed: 0,channels,number_of_videos
1,Facebook,1
2,Hackwagon,5
4,Sony,1
5,Subway,1
6,Netflix,1
7,Uber,6


`.isin()` can also be used to filter for conditions that are `int`s or `float`s.

In [13]:
# Filter for channels that have 2, 4, 5 or 6 videos.
req_videos_list = [2, 4, 5, 6]
channels_df[channels_df['number_of_videos'].isin(req_videos_list)]

Unnamed: 0,channels,number_of_videos
0,Apple,2
2,Hackwagon,5
7,Uber,6
8,Health Promotion Board,2


#### Use `DataFrame.fillna()` to fill in `NaN` values (missing data)
It is common for the dataset to have missing data. In this case, use `df.fillna()` to specify what value to fill in for the missing data. If a record has missing data, some aggregates (like mean, minimum or maximum) cannot be calculated. In this case, fill all missing values for `price` with `0`.

Read the [documentation](https://pandas.pydata.org/pandas-docs/stable/missing_data.html) to find out more. A similar implementation is available for `Series`. Read more [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.dropna.html#pandas.Series.fillna).

In [30]:
# Read from the wines-3k CSV file
wines_df = pd.read_csv('wines-3k.csv', sep='|')
wines_df.head()

Unnamed: 0,country,points,price
0,US,84,12.0
1,Italy,87,15.0
2,US,86,48.0
3,US,89,150.0
4,Italy,89,59.0


In [36]:
###
# fillna() / fill for all missing prices
###
wines_df_fill = wines_df.copy()
print(wines_df_fill.head(20))
# We are directly filling in the df, hence, inplace=True must be specified
wines_df_fill.fillna(0, inplace=True)
print()
print(wines_df_fill.head(20))
#Observe that for index 14, the value for 'price' has been replaced to 0.0

      country  points  price
0          US      84   12.0
1       Italy      87   15.0
2          US      86   48.0
3          US      89  150.0
4       Italy      89   59.0
5          US      87   25.0
6       Italy      87   26.0
7       Spain      85   20.0
8          US      89   20.0
9          US      92   25.0
10      Italy      88   49.0
11   Portugal      83   16.0
12     France      88   85.0
13      Italy      87   12.0
14     France      93    NaN
15         US      85   18.0
16         US      88   46.0
17     France      89   20.0
18  Argentina      81    6.0
19  Argentina      85   13.0

      country  points  price
0          US      84   12.0
1       Italy      87   15.0
2          US      86   48.0
3          US      89  150.0
4       Italy      89   59.0
5          US      87   25.0
6       Italy      87   26.0
7       Spain      85   20.0
8          US      89   20.0
9          US      92   25.0
10      Italy      88   49.0
11   Portugal      83   16.0
12     France

#### Use `DataFrame.dropna()` to remove rows with `NaN` values (missing data)
Alternatively, use `dropna(axis=1)` to remove records with missing data. `axis=0` needs to be specified so rows, not columns are removed.

Read the [documentation](https://pandas.pydata.org/pandas-docs/stable/missing_data.html) to find out more. A similar implementation is available for `Series`. Read more [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.dropna.html#pandas.Series.dropna).

In [16]:
###
# dropna() / remove all missing prices
###
wines_df_drop = wines_df.copy()
print(wines_df_drop.head(20))
wines_df_drop.dropna(axis=0, inplace=True)
print()
print(wines_df_drop.head(20))
# Observe that the record at location index 14 is removed

      country  points  price
0          US      84   12.0
1       Italy      87   15.0
2          US      86   48.0
3          US      89  150.0
4       Italy      89   59.0
5          US      87   25.0
6       Italy      87   26.0
7       Spain      85   20.0
8          US      89   20.0
9          US      92   25.0
10      Italy      88   49.0
11   Portugal      83   16.0
12     France      88   85.0
13      Italy      87   12.0
14     France      93    NaN
15         US      85   18.0
16         US      88   46.0
17     France      89   20.0
18  Argentina      81    6.0
19  Argentina      85   13.0

      country  points  price
0          US      84   12.0
1       Italy      87   15.0
2          US      86   48.0
3          US      89  150.0
4       Italy      89   59.0
5          US      87   25.0
6       Italy      87   26.0
7       Spain      85   20.0
8          US      89   20.0
9          US      92   25.0
10      Italy      88   49.0
11   Portugal      83   16.0
12     France

#### Use `DataFrame.isnull()` to find rows with `NaN` values (missing data)
Alternatively, use `isnull()` to find all records with missing data.columns are removed.

Read the [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isnull.html?highlight=isnull#pandas.DataFrame.isnull) to find out more. A similar implementation is available for `Series`. Read more [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.isnull.html?highlight=series%20isnull#pandas.Series.isnull).

In [34]:
###
# isnull() to find all missing prices
###
wines_df[wines_df['price'].isnull()].iloc[:5] #Remove this index to see all records


Unnamed: 0,country,points,price
14,France,93,
27,Austria,94,
32,Italy,92,
36,Italy,87,
52,Australia,85,


#### Use `DataFrame.count()` to find the number of filled rows per column

In [37]:
print(wines_df.count())
print()
print(wines_df_drop.count())
# Observe that all the records with missing data has been dropped.

country    3000
points     3000
price      2718
dtype: int64



NameError: name 'wines_df_drop' is not defined

Alternatively, use `len()` of a `Series` to get the number of records.

In [19]:
len(wines_df_drop['country'])

2718

### Retrieve records from `DataFrame` or `Series`

Use list indices to retrieve a slice of the `DataFrame`. Similarly (as you can see), indices start from `0`. In fact, the left-most column is the index of the `df`.

In [42]:
wines_df_fill[10:20]

Unnamed: 0,country,points,price
10,Italy,88,49.0
11,Portugal,83,16.0
12,France,88,85.0
13,Italy,87,12.0
14,France,93,0.0
15,US,85,18.0
16,US,88,46.0
17,France,89,20.0
18,Argentina,81,6.0
19,Argentina,85,13.0


Indices can also be used to slice from a `Series`.

In [21]:
wine_points = wines_df_fill['points']
print(type(wine_points))
wine_800_series = wine_points[850:860:2] #Similarly, the 3rd number in the colon (:) notation represents the step
wine_800_series

<class 'pandas.core.series.Series'>


850    90
852    85
854    88
856    90
858    92
Name: points, dtype: int64

To retrieve multiple specific rows, store the indices in a `list` and use `.loc[]` to retrieve them. 

**Note:** After `.loc`, use square brackets `[]`, not round brackets.

In [22]:
wines_df_fill.loc[[11, 15, 16]]

Unnamed: 0,country,points,price
11,Portugal,83,16.0
15,US,85,18.0
16,US,88,46.0


Similar to `list`s, to update a value in a `Series`, specify the index and then specify the new value.

In [23]:
print(wine_800_series)
print()
wine_800_series.loc[850] = 92 #Observe that the value at index no. 850 has been updated.
print(wine_800_series)

850    90
852    85
854    88
856    90
858    92
Name: points, dtype: int64

850    92
852    85
854    88
856    90
858    92
Name: points, dtype: int64


To only perform analysis on a **specific** set of columns, store all the columns to keep in a list and call it using `df[[col1, col2, ...]]`.

**Note**: Store this new `df` as a **NEW** variable. Also, use `df.copy()` before slicing the `df` by column. Column names are **case sensitive**.

In [24]:
wines_points_df = wines_df_fill.copy() #Make a copy of the original df
wines_points_df = wines_points_df[['country', 'points']] #Select specific columns.
wines_points_df.head()

Unnamed: 0,country,points
0,US,84
1,Italy,87
2,US,86
3,US,89
4,Italy,89


### Exploration: Sorting by values in a `df`

In [49]:
# Exploration: sort the wines by prices in ascending order, smallest value on top
wines_points_price_sort = wines_df.copy()
wines_points_price_sort = wines_points_price_sort.sort_values('price')
wines_points_price_sort.dropna()

Unnamed: 0,country,points,price
1819,Germany,86,5.0
2175,Argentina,84,5.0
2166,US,83,5.0
77,Spain,81,6.0
2781,US,84,6.0
...,...,...,...
1876,France,97,288.0
1505,US,94,300.0
2053,France,96,360.0
1902,France,99,450.0


In [50]:
# Exploration: sort the wines by points in descending order, largest value on top
wines_points = wines_df.copy()
wines_points.sort_values("points", ascending = False)

Unnamed: 0,country,points,price
522,US,100,200.0
919,US,100,245.0
1902,France,99,450.0
972,US,98,147.0
114,Germany,97,250.0
...,...,...,...
711,Uruguay,80,15.0
1940,US,80,17.0
1651,Spain,80,11.0
1720,US,80,20.0


**Credits**
- [Wine Reviews, Kaggle](https://www.kaggle.com/zynicide/wine-reviews) for the dataset