# Week-09: Tutorial on Pandas

<font size='4'>

This week, we start to revisit `pandas` in detail.

Pandas is used for
- Import datasets from databases, spreadsheets, comma-separated values (CSV) files, etc.
- Clean datasets, i.e., handling missing values
- Tidy datasets by reshaping the structure into a suitable format prior to analysis.
- Aggregate data by calculating summary statistics.
- Visualize datasets and uncover hidden patterns.

## 0. Import packages

<font size='4'>

You should be pretty familiar with importing packages.

In [2]:
# 0.1
import os
import glob
import numpy as np
import pandas as pd

## 1. Import datasets/files to Pandas

### 1.1. Import comma-separated values (CSV) file

<font size='4'>
    
- Use `pd.read_csv()` with the path to the CSV file.
- The resulting object is a pandas Dataframe object named `feature_df`.
- https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

In [3]:
# 1.1.1
print(os.getcwd())

/Users/tma33/Library/CloudStorage/OneDrive-EmoryUniversity/Emory/Rollins SPH/2025/BIOS-584/python_proj


In [4]:
# 1.1.2
feature_dir = '{}/data/features.csv'.format(os.getcwd())
feature_df = pd.read_csv(feature_dir)
print(type(feature_df))
print(feature_df)

<class 'pandas.core.frame.DataFrame'>
      mpg  cylinders  displacement horsepower  weight  acceleration vehicle id
0    18.0          8           307        130    3504          12.0  C-1689780
1    15.0          8           350        165    3693          11.5  B-1689791
2    18.0          8           318        150    3436          11.0  P-1689802
3    16.0          8           304        150    3433          12.0  A-1689813
4    17.0          8           302        140    3449          10.5  F-1689824
..    ...        ...           ...        ...     ...           ...        ...
393  27.0          4           140         86    2790          15.6  F-1694103
394  44.0          4            97         52    2130          24.6  V-1694114
395  32.0          4           135         84    2295          11.6  D-1694125
396  28.0          4           120         79    2625          18.6  F-1694136
397  31.0          4           119         82    2720          19.4  C-1694147

[398 rows x 7

### 1.2. Import text files

<font size='4'>

- Reading text files is similar to CSV files.
- You use `pd.read_csv()` function.
- The only difference is that you need to specify a separator with the `sep` parameter (argument).
- The separator argument refers to the symbol used to separate rows in a DataFrame.
- Common separators include
    - Comma (`sep=','`),
    - Single whitespace (`sep='\s'`),
    - Multiple whitespace (`sep='\s+'`),
    - Tab (`sep='\t'`),
    - Colon (`sep=':'`) 

In [5]:
# 1.2.1
arr_space_dir = '{}/data/my_arr2_delimiter_space.txt'.format(os.getcwd())
arr_space_df = pd.read_csv(arr_space_dir, sep='\s')
print(arr_space_df)

   Value1  Value2  Value3
0  0.4839  0.4536  0.3561
1  0.1292  0.6875 -9.0000
2  0.1781  0.3049  0.8928
3 -9.0000  0.5801  0.2038
4  0.5993  0.4357  0.7410


  arr_space_df = pd.read_csv(arr_space_dir, sep='\s')


In [6]:
# 1.2.2
arr_space_dir = '{}/data/my_arr2_delimiter_space.txt'.format(os.getcwd())
arr_space_df = pd.read_csv(arr_space_dir, sep='\s+')
print(arr_space_df)

   Value1  Value2  Value3
0  0.4839  0.4536  0.3561
1  0.1292  0.6875 -9.0000
2  0.1781  0.3049  0.8928
3 -9.0000  0.5801  0.2038
4  0.5993  0.4357  0.7410


In [7]:
# 1.2.3
arr_comma_dir = '{}/data/my_arr2_delimiter_comma.txt'.format(os.getcwd())
arr_comma_df = pd.read_csv(arr_comma_dir, sep=',')
print(arr_comma_df)

   Value1  Value2  Value3
0  0.4839  0.4536  0.3561
1  0.1292  0.6875 -9.0000
2  0.1781  0.3049  0.8928
3 -9.0000  0.5801  0.2038
4  0.5993  0.4357  0.7410


### 1.3. Import Excel files (single sheet)

<font size='4'>

- For excel files (.xls and .xlsx), use `pd.read_excel()` function and fill in with the file path.
- You can specify the `header`. It has a default value of `0`, which denotes the first row as headers or column names.
- You can also specify column names as a list in the `names` argument.
- The `index_col` (default is `None`) argument can be used if the file contains a row index.
    - In a pd dataframe or series, the index is an identifier that points to the location of a row or column.
    - You can access to a specific row or column by using its index.
    - We will learn it more later.

In [8]:
# 1.3.1
ptsd_dir = '{}/data/PTSD dataset.xlsx'.format(os.getcwd())
ptsd_df = pd.read_excel(ptsd_dir)
print(ptsd_df)

Empty DataFrame
Columns: []
Index: []


### 1.4. Import Excel files (multiple sheets)

<font size='4'>

- As you may know, the `ptsd_df` above is an empty dataframe. That is because I manually created an empty one in the first tab.
- To read the excel file with a particular tab name, simply specify the argument `sheet_name`. You can either pass the actual name (in a string) or an integer for the sheet position.
- Note that the Python uses `0`-indexing.

In [9]:
# 1.4.1
ptsd_dir = '{}/data/PTSD dataset.xlsx'.format(os.getcwd())
ptsd_df = pd.read_excel(ptsd_dir, sheet_name='main_dataset')
# print(ptsd_df)

In [10]:
# 1.4.2
print('Or')
ptsd_df = pd.read_excel(ptsd_dir, sheet_name=1) # remember that python's index starts from 0.
# print(ptsd_df)

Or


### 1.5. Import JSON file

<font size='4'>

- Similar to .csv file, you use `pd.read_json()` function for JSON file.
- A special trick to quickly identify the file directory using `*` and `glob.glob()` function.

In [11]:
# 1.5.1
swlda_dir = '{}/data/swLDA*'.format(os.getcwd())
print(swlda_dir)
print(glob.glob(swlda_dir))
swlda_df = pd.read_json(glob.glob(swlda_dir)[0])
print(swlda_df.keys())
print()
print(swlda_df['NewOnly'])
print()
# print(swlda_df['NewOnly']['letter'])
# print(swlda_df['NewOnly']['prob'])

/Users/tma33/Library/CloudStorage/OneDrive-EmoryUniversity/Emory/Rollins SPH/2025/BIOS-584/python_proj/data/swLDA*
['/Users/tma33/Library/CloudStorage/OneDrive-EmoryUniversity/Emory/Rollins SPH/2025/BIOS-584/python_proj/data/swLDA_predict_BCI_001_TRN_test_seq_size_1_threshold_0.5.json']
Index(['NewOnly', 'Strict'], dtype='object')

letter    [[Z, T, T, T, T, T, T], [H, H, H, H, H, H, H],...
prob      [[[0.000766330793406, 0.07476473805305, 0.0001...
Name: NewOnly, dtype: object



In [12]:
PTSD_data_dictionary_dir = '{}/data/*dictionary*'.format(os.getcwd())
print(glob.glob(PTSD_data_dictionary_dir))

['/Users/tma33/Library/CloudStorage/OneDrive-EmoryUniversity/Emory/Rollins SPH/2025/BIOS-584/python_proj/data/PTSD_data_dictionary.xlsx']


## 2. Outputting data in pandas

### 2.1. Outputting a DataFrame into a CSV file

<font size='4'>

- Suppose that we have created a dataframe `test_df` and we want to save it as a CSV file, we use `to_csv()` method.
- The arguments include `path_or_buf` filename with path and `index`, where `index=True` implies including a separate column for the dataframe's index. It can also be `False` or `None`.
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html 

In [13]:
# 2.1.1
input_ls = [[1,2,3],[4,5,6],[7,8,9],[10,11,12],[13,14,15],[16,17,18]]
test_df = pd.DataFrame(input_ls, columns=['var1', 'var2', 'var3'])
print(test_df)
test_df_csv_dir = '{}/data/week_09_test.csv'.format(os.getcwd())
test_df.to_csv(path_or_buf=test_df_csv_dir, index=False)

   var1  var2  var3
0     1     2     3
1     4     5     6
2     7     8     9
3    10    11    12
4    13    14    15
5    16    17    18


<font size='4'>

- Open the APIs for `read_csv()` and `to_csv()` above.
- A small distinction between `pd.read_csv()` and `pd.DataFrame.to_csv()` for their API references:
- The first one (is a **function**) implies that it does not rely on an existing dataframe, while the second one (is a **method**) implies that it has to be called based on an existing dataframe.
    - For our example, you import a new data file to your working environment, you simply write `xxx = pd.read_csv()`.
    - You save your existing dataframe to a CSV file, you need to add `existing_df.to_csv()`.

### 2.2. Outputting a DataFrame into a text file

<font size='4'>

- Similar to CSV file, we use `to_csv()` method.
- When saving the output file format in `.txt`, you specify a separator using the `sep` argument.

In [14]:
# 2.2.1
test_df_text_dir = '{}/data/week_09_test.txt'.format(os.getcwd())
test_df.to_csv(path_or_buf=test_df_text_dir, index=None, sep=',')

### 2.3. Outputting a DataFrame into a Excel file

<font size='4'>

- Similar to a `.xls` or `.xlsx` file, we use `to_excel()` method.

In [15]:
# 2.3.1
test_df_excel_dir = '{}/data/week_09_test.xlsx'.format(os.getcwd())
test_df.to_excel(excel_writer=test_df_excel_dir, index=None, sheet_name='week_09_test')

### 2.4. Outputting a DataFrame into a JSON file

<font size='4'>

- Similar to a `.json` file, we use `to_json()` method.

In [16]:
# 2.4.1
test_df_json_dir = '{}/data/week_09_test.json'.format(os.getcwd())
test_df.to_json(path_or_buf=test_df_json_dir)

## 3. View and Understand DataFrames using Pandas

### 3.1. Head and Tail Methods
<font size='4'>

- Similar to functions in R, you can view the first few or last few rows of a DataFrame using the `.head()` or `.tail()` methods, respectively.
- You specify the number of rows through `n` argument (default value is 5).

In [17]:
# 3.1.1
feature_df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,vehicle id
0,18.0,8,307,130,3504,12.0,C-1689780
1,15.0,8,350,165,3693,11.5,B-1689791
2,18.0,8,318,150,3436,11.0,P-1689802
3,16.0,8,304,150,3433,12.0,A-1689813
4,17.0,8,302,140,3449,10.5,F-1689824


In [18]:
# 3.1.2
feature_df.tail(n=6)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,vehicle id
392,27.0,4,151,90,2950,17.3,C-1694092
393,27.0,4,140,86,2790,15.6,F-1694103
394,44.0,4,97,52,2130,24.6,V-1694114
395,32.0,4,135,84,2295,11.6,D-1694125
396,28.0,4,120,79,2625,18.6,F-1694136
397,31.0,4,119,82,2720,19.4,C-1694147


### 3.2. Describe Method

<font size='4'>

- The `.describe()` method prints the summary statistics of all numeric columns, such as count, mean, std, range, and IQR.
- It gives a quick look at the scale, skew, and range of numeric data.

In [19]:
# 3.2.1
feature_df.describe()

Unnamed: 0,mpg,cylinders,displacement,weight,acceleration
count,398.0,398.0,398.0,398.0,398.0
mean,23.514573,5.454774,193.427136,2970.424623,15.56809
std,7.815984,1.701004,104.268683,846.841774,2.757689
min,9.0,3.0,68.0,1613.0,8.0
25%,17.5,4.0,104.25,2223.75,13.825
50%,23.0,4.0,148.5,2803.5,15.5
75%,29.0,8.0,262.0,3608.0,17.175
max,46.6,8.0,455.0,5140.0,24.8


<font size='4'>

- You can modify the quartiles using `percentiles` argument. The input argument takes a list of values between 0 and 1.

In [20]:
# 3.2.2
feature_df.describe(percentiles=[0.3, 0.5, 0.7])

Unnamed: 0,mpg,cylinders,displacement,weight,acceleration
count,398.0,398.0,398.0,398.0,398.0
mean,23.514573,5.454774,193.427136,2970.424623,15.56809
std,7.815984,1.701004,104.268683,846.841774,2.757689
min,9.0,3.0,68.0,1613.0,8.0
30%,18.0,4.0,112.0,2301.0,14.2
50%,23.0,4.0,148.5,2803.5,15.5
70%,27.49,6.0,250.0,3424.5,16.8
max,46.6,8.0,455.0,5140.0,24.8


<font size='4'>

- You can inlcude or exclude specific data types in the summary output.

In [21]:
# 3.2.3
feature_df.describe(include=[int])

Unnamed: 0,cylinders,displacement,weight
count,398.0,398.0,398.0
mean,5.454774,193.427136,2970.424623
std,1.701004,104.268683,846.841774
min,3.0,68.0,1613.0
25%,4.0,104.25,2223.75
50%,4.0,148.5,2803.5
75%,8.0,262.0,3608.0
max,8.0,455.0,5140.0


In [22]:
# 3.2.4
feature_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
mpg,398.0,23.514573,7.815984,9.0,17.5,23.0,29.0,46.6
cylinders,398.0,5.454774,1.701004,3.0,4.0,4.0,8.0,8.0
displacement,398.0,193.427136,104.268683,68.0,104.25,148.5,262.0,455.0
weight,398.0,2970.424623,846.841774,1613.0,2223.75,2803.5,3608.0,5140.0
acceleration,398.0,15.56809,2.757689,8.0,13.825,15.5,17.175,24.8


<font size='4'>
    
- Pandas Cheatsheet for data wrangling in Python: 
- https://www.datacamp.com/cheat-sheet/pandas-cheat-sheet-data-wrangling-in-python

### 3.3. Info Method

<font size='4'>

- The `.info()` method: A quick way to look at data types, missing values, and data size of a DataFrame.
- Some frequently used parameters: `show_counts`, `memory_usage`, and `verbose`.

In [23]:
# 3.3.1
feature_df.info(show_counts=True, memory_usage=True, verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    int64  
 3   horsepower    397 non-null    object 
 4   weight        398 non-null    int64  
 5   acceleration  398 non-null    float64
 6   vehicle id    398 non-null    object 
dtypes: float64(2), int64(3), object(2)
memory usage: 21.9+ KB


### 3.4. Shape Attribute
<font size='4'>
    
- The number of rows and columns of a DataFrame can be determined using the `.shape` attribute.
- An attribute is a feature or property of a specific python object. It does not have `()` because it is fixed once the object is specified.
- It returns a tuple (row, column) and can be indexed to only obtain rows, and only columns as output.

In [24]:
# 3.4.1
feature_df.shape

(398, 7)

In [25]:
# 3.4.2
feature_df.shape[0] # the number of rows only

398

In [26]:
# 3.4.3
feature_df.shape[1] # the number of columns only

7

In [27]:
# 3.4.4
n_row, n_col = feature_df.shape
print(n_row, n_col)

398 7


In [28]:
# 3.4.5
n_row, _ = feature_df.shape # if you do not care about # of column.
for n_iter in range(n_row):
    pass

### 3.5. Get all columns and their column names

<font size='4'>

- The `.columns` attribute of a DataFrame object returns the column names in the form of an `Index` object.
- A pandas index is the address/label of the row or column.
- You previously converted it to a list using a `list()` function.


In [29]:
# 3.5.1
feature_df.columns
column_ls = feature_df.columns.tolist()
print(column_ls)
column_ls = list(feature_df.columns)
print(column_ls)

['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'vehicle id']
['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'vehicle id']


### 3.6. Check for missing values

<font size='4'>

- The `.copy()` method makes a copy of the original DataFrame.
- This is done to ensure that any changes to the copy do not reflect in the original DataFrame.
- Using `.loc`, you can modify the values with given rows and column names, i.e., `NaN`. (`NaN` denotes missing values.)

In [30]:
# 3.6.1
feature_df2 = feature_df.copy()
feature_df2.loc[2:5, 'mpg'] = None # starts from zero, both boundaries are INCLUSIVE.
feature_df2.head(n=5)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,vehicle id
0,18.0,8,307,130,3504,12.0,C-1689780
1,15.0,8,350,165,3693,11.5,B-1689791
2,,8,318,150,3436,11.0,P-1689802
3,,8,304,150,3433,12.0,A-1689813
4,,8,302,140,3449,10.5,F-1689824


<font size='4'>

- You can check whether each element in a DataFrame is missing using `.isnull()` method.
- You can combine `.isnull()` and `.sum()` to count the number of nulls in each column.

In [31]:
# 3.6.2
feature_df2.isnull().head(n=6)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,vehicle id
0,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False
2,True,False,False,False,False,False,False
3,True,False,False,False,False,False,False
4,True,False,False,False,False,False,False
5,True,False,False,False,False,False,False


In [32]:
# 3.6.3
feature_df2.isnull().sum()

mpg             4
cylinders       0
displacement    0
horsepower      1
weight          0
acceleration    0
vehicle id      0
dtype: int64

In [33]:
# 3.6.4
int(feature_df2.isnull().sum().sum())

5

## 4. Sorting, Slicing, and Extracting Data in pandas

### 4.1. Sorting

<font size='4'>

- To sort a DataFrame by a specific column, use `.sort_values()` method.
- `inplace` argument refers to performing an operation "in-place" means modifying the original data structure or object directly, without creating a separate copy of it.

In [34]:
# 4.1.1
feature_df.sort_values(by='mpg', ascending=False, inplace=True)
print(feature_df.head(n=10))

      mpg  cylinders  displacement horsepower  weight  acceleration vehicle id
322  46.6          4            86         65    2110          17.9  M-1693322
329  44.6          4            91         67    1850          13.8  H-1693399
325  44.3          4            90         48    2085          21.7  V-1693355
394  44.0          4            97         52    2130          24.6  V-1694114
326  43.4          4            90         48    2335          23.7  V-1693366
244  43.1          4            90         48    1985          21.5  V-1692464
309  41.5          4            98         76    2144          14.7  V-1693179
330  40.9          4            85          ?    1835          17.3  R-1693410
324  40.8          4            85         65    2110          19.2  D-1693344
247  39.4          4            85         70    2070          18.6  D-1692497


In [35]:
# 4.1.2
feature_df.sort_values(by=['mpg', 'acceleration'], ascending=[False, True], inplace=True)
print(feature_df.head(n=10))

      mpg  cylinders  displacement horsepower  weight  acceleration vehicle id
322  46.6          4            86         65    2110          17.9  M-1693322
329  44.6          4            91         67    1850          13.8  H-1693399
325  44.3          4            90         48    2085          21.7  V-1693355
394  44.0          4            97         52    2130          24.6  V-1694114
326  43.4          4            90         48    2335          23.7  V-1693366
244  43.1          4            90         48    1985          21.5  V-1692464
309  41.5          4            98         76    2144          14.7  V-1693179
330  40.9          4            85          ?    1835          17.3  R-1693410
324  40.8          4            85         65    2110          19.2  D-1693344
247  39.4          4            85         70    2070          18.6  D-1692497


### 4.2. Resetting the index

<font size='4'>

- If you filter or sort a DataFrame, your index might become misaligned. Use `.reset_index()` to fix this.

In [36]:
# 4.2.1
feature_df.reset_index(drop=True, inplace=True) # reset index and remove old one
print(feature_df.head(n=10))

    mpg  cylinders  displacement horsepower  weight  acceleration vehicle id
0  46.6          4            86         65    2110          17.9  M-1693322
1  44.6          4            91         67    1850          13.8  H-1693399
2  44.3          4            90         48    2085          21.7  V-1693355
3  44.0          4            97         52    2130          24.6  V-1694114
4  43.4          4            90         48    2335          23.7  V-1693366
5  43.1          4            90         48    1985          21.5  V-1692464
6  41.5          4            98         76    2144          14.7  V-1693179
7  40.9          4            85          ?    1835          17.3  R-1693410
8  40.8          4            85         65    2110          19.2  D-1693344
9  39.4          4            85         70    2070          18.6  D-1692497


### 4.3. Filtering data using conditions

<font size='4'>

- Use `[]` to specify conditions

In [37]:
# 4.3.1
feature_sub_df = feature_df[feature_df['weight']<2000]
feature_sub_df.head(n=3)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,vehicle id
1,44.6,4,91,67,1850,13.8,H-1693399
5,43.1,4,90,48,1985,21.5,V-1692464
7,40.9,4,85,?,1835,17.3,R-1693410


### 4.4. Isolating one column using `[ ]`
<font size='4'>

- You can isolate a single column using a square bracket `[]` with a column name in it.
- The output is a pandas `Series` object.
- A pandas Series is a one-dimensional array containing data of any type, including integer, float, string, boolean, python objects, etc.
- A DataFrame is comprised of many series that act as columns.

In [38]:
# 4.4.1
feature_df['cylinders']

0      4
1      4
2      4
3      4
4      4
      ..
393    8
394    8
395    8
396    8
397    8
Name: cylinders, Length: 398, dtype: int64

### 4.5. Isolating more than one columns using `[[ ]]`

<font size='4'>

- Isolating two or more columns using `[[ ]]`
- You can provide a `list` of columns inside the square brackets to fetch more than one column.
- Here, square brackets had two functions:
- The outer square brackets indicate a subset of a DataFrame.
- The inner suqare brackets is to create a list.

In [39]:
# 4.5.1
feature_sub_df2 = feature_df[['mpg', 'vehicle id']]
print(feature_sub_df2.head(n=3))

    mpg vehicle id
0  46.6  M-1693322
1  44.6  H-1693399
2  44.3  V-1693355


### 4.6. Isolating one row using `[ ]`

<font size='4'>

- We have talked about subsetting columns. What about subsetting rows?
- A single row can be fetched by passing in a boolean series with one `True` value.
- For example, let's select the second row `index=1`.

In [40]:
# 4.6.1
feature_df[feature_df.index==1]

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,vehicle id
1,44.6,4,91,67,1850,13.8,H-1693399


### 4.7. Isolating two or more rows using `[ ]`

<font size='4'>

- Similarly, we use `[ ]` to isolate two or more rows and `.isin()` method instead of `==` operator.

In [41]:
# 4.7.1
feature_df[feature_df.index.isin(range(2,10))]
# Notice that range(2, 10) has lower inclusive but upper exclusive.

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,vehicle id
2,44.3,4,90,48,2085,21.7,V-1693355
3,44.0,4,97,52,2130,24.6,V-1694114
4,43.4,4,90,48,2335,23.7,V-1693366
5,43.1,4,90,48,1985,21.5,V-1692464
6,41.5,4,98,76,2144,14.7,V-1693179
7,40.9,4,85,?,1835,17.3,R-1693410
8,40.8,4,85,65,2110,19.2,D-1693344
9,39.4,4,85,70,2070,18.6,D-1692497


### 4.8. Use `.loc[]` and `.iloc[]`

<font size='4'>

- Use `.loc[]` and `.iloc[]` to fetch rows
- `.loc[]` uses a label to point to a row, column, or cell
- `.iloc[]` uses the numeric position.

In [42]:
# 4.8.1
feature_df2.shape

(398, 7)

In [43]:
# 4.8.2
feature_df2.index = np.arange(1, 399, 1)

In [44]:
# 4.8.3
feature_df2.head(n=5)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,vehicle id
1,18.0,8,307,130,3504,12.0,C-1689780
2,15.0,8,350,165,3693,11.5,B-1689791
3,,8,318,150,3436,11.0,P-1689802
4,,8,304,150,3433,12.0,A-1689813
5,,8,302,140,3449,10.5,F-1689824


In [45]:
# 4.8.4
feature_df2.loc[1]
# it returns a Series object

mpg                  18.0
cylinders               8
displacement          307
horsepower            130
weight               3504
acceleration         12.0
vehicle id      C-1689780
Name: 1, dtype: object

In [46]:
# 4.8.5
feature_df2.iloc[0]
# always start from 0, its absolute numeric index.

mpg                  18.0
cylinders               8
displacement          307
horsepower            130
weight               3504
acceleration         12.0
vehicle id      C-1689780
Name: 1, dtype: object

In [47]:
# 4.8.6
print(feature_df2.head(n=5))
feature_df2.loc[1:4]

    mpg  cylinders  displacement horsepower  weight  acceleration vehicle id
1  18.0          8           307        130    3504          12.0  C-1689780
2  15.0          8           350        165    3693          11.5  B-1689791
3   NaN          8           318        150    3436          11.0  P-1689802
4   NaN          8           304        150    3433          12.0  A-1689813
5   NaN          8           302        140    3449          10.5  F-1689824


Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,vehicle id
1,18.0,8,307,130,3504,12.0,C-1689780
2,15.0,8,350,165,3693,11.5,B-1689791
3,,8,318,150,3436,11.0,P-1689802
4,,8,304,150,3433,12.0,A-1689813


In [48]:
# 4.8.7
feature_df2.iloc[:4]

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,vehicle id
1,18.0,8,307,130,3504,12.0,C-1689780
2,15.0,8,350,165,3693,11.5,B-1689791
3,,8,318,150,3436,11.0,P-1689802
4,,8,304,150,3433,12.0,A-1689813


<font size='4'>

- You can subset using a list instead of a range.

In [49]:
# 4.8.8
feature_df2.loc[[1,3,5]]
# print(feature_df2.head(n=5))

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,vehicle id
1,18.0,8,307,130,3504,12.0,C-1689780
3,,8,318,150,3436,11.0,P-1689802
5,,8,302,140,3449,10.5,F-1689824


In [50]:
# 4.8.9
feature_df2.iloc[[0,2,4]]

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,vehicle id
1,18.0,8,307,130,3504,12.0,C-1689780
3,,8,318,150,3436,11.0,P-1689802
5,,8,302,140,3449,10.5,F-1689824


<font size='4'>

- You can also select specific columns along with rows.
- `loc[]` requires all labels, while `iloc[]` requires all numbers to index the locations.
    - You can use either list or numpy array. For numpy array, make sure they are all integers.

In [51]:
# 4.8.10
feature_df2.loc[1:5, ['mpg', 'vehicle id']]

Unnamed: 0,mpg,vehicle id
1,18.0,C-1689780
2,15.0,B-1689791
3,,P-1689802
4,,A-1689813
5,,F-1689824


In [52]:
# 4.8.11
feature_df2.iloc[:5, [0, 6]]

Unnamed: 0,mpg,vehicle id
1,18.0,C-1689780
2,15.0,B-1689791
3,,P-1689802
4,,A-1689813
5,,F-1689824


In [53]:
# 4.8.12
feature_df2.iloc[:4, np.array([0, 6])]

Unnamed: 0,mpg,vehicle id
1,18.0,C-1689780
2,15.0,B-1689791
3,,P-1689802
4,,A-1689813


<font size='4'>

- You can update/modify certain values by using the assignment opertaor `=`.

In [54]:
# 4.8.13
feature_df2.head(n=5)
# We want to change the third mpg from NaN to 16.0.

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,vehicle id
1,18.0,8,307,130,3504,12.0,C-1689780
2,15.0,8,350,165,3693,11.5,B-1689791
3,,8,318,150,3436,11.0,P-1689802
4,,8,304,150,3433,12.0,A-1689813
5,,8,302,140,3449,10.5,F-1689824


In [55]:
# 4.8.14
feature_df2.loc[3, ['mpg']] = 16.0
feature_df2.head(n=5)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,vehicle id
1,18.0,8,307,130,3504,12.0,C-1689780
2,15.0,8,350,165,3693,11.5,B-1689791
3,16.0,8,318,150,3436,11.0,P-1689802
4,,8,304,150,3433,12.0,A-1689813
5,,8,302,140,3449,10.5,F-1689824


In [56]:
# 4.8.15
feature_df2.loc[3, ['mpg']] = None
feature_df2.head(n=5)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,vehicle id
1,18.0,8,307,130,3504,12.0,C-1689780
2,15.0,8,350,165,3693,11.5,B-1689791
3,,8,318,150,3436,11.0,P-1689802
4,,8,304,150,3433,12.0,A-1689813
5,,8,302,140,3449,10.5,F-1689824


In [57]:
# 4.8.16
# Or we can use iloc[], make sure all inputs are integers (starting from 0).
feature_df2.iloc[2, 0] = 16.0
feature_df2.head(n=5)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,vehicle id
1,18.0,8,307,130,3504,12.0,C-1689780
2,15.0,8,350,165,3693,11.5,B-1689791
3,16.0,8,318,150,3436,11.0,P-1689802
4,,8,304,150,3433,12.0,A-1689813
5,,8,302,140,3449,10.5,F-1689824


### 4.9. Conditional slicing

<font size='4'>

- For example, we want to find the rows where **cylinders** are 6.
- We isolate rows using the square bractes `[]` and use equal operator `==` to identify cylinders are 6.

In [58]:
# 4.9.1
feature_df2[feature_df2.cylinders==6]

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,vehicle id
16,22.0,6,198,95,2833,15.5,P-1689945
17,18.0,6,199,97,2774,15.5,A-1689956
18,21.0,6,200,85,2587,16.0,F-1689967
25,21.0,6,199,90,2648,15.0,A-1690044
34,19.0,6,232,100,2634,13.0,A-1690143
...,...,...,...,...,...,...,...
366,20.2,6,200,88,3060,17.1,F-1693795
367,17.6,6,225,85,3465,16.6,C-1693806
387,25.0,6,181,110,2945,16.4,B-1694026
388,38.0,6,262,85,3015,17.0,O-1694037


In [59]:
# 4.9.2
feature_df2.loc[feature_df2.mpg>20, ['mpg', 'cylinders', 'vehicle id']]

Unnamed: 0,mpg,cylinders,vehicle id
15,24.0,4,T-1689934
16,22.0,6,P-1689945
18,21.0,6,F-1689967
19,27.0,4,D-1689978
20,26.0,4,V-1689989
...,...,...,...
394,27.0,4,F-1694103
395,44.0,4,V-1694114
396,32.0,4,D-1694125
397,28.0,4,F-1694136


## 5. Debugging in PyCharm

<font size='4'>

https://www.jetbrains.com/help/pycharm/debugging-your-first-python-application.html#summary 