# Week-09: Tutorial on Pandas

<font size='4'>

This week, we start to revisit `pandas` in detail.

Pandas is used for
- Import datasets from databases, spreadsheets, comma-separated values (CSV) files, etc.
- Clean datasets, i.e., handling missing values
- Tidy datasets by reshaping the structure into a suitable format prior to analysis.
- Aggregate data by calculating summary statistics.
- Visualize datasets and uncover hidden patterns.

## 0. Import packages

<font size='4'>

You should be pretty familiar with importing packages.

In [1]:
# 0.1


## 1. Import datasets/files to Pandas

### 1.1. Import comma-separated values (CSV) file

<font size='4'>
    
- Use `pd.read_csv()` with the path to the CSV file.
- The resulting object is a pandas Dataframe object named `feature_df`.
- https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

In [2]:
# 1.1.1


In [3]:
# 1.1.2


### 1.2. Import text files

<font size='4'>

- Reading text files is similar to CSV files.
- You use `pd.read_csv()` function.
- The only difference is that you need to specify a separator with the `sep` parameter (argument).
- The separator argument refers to the symbol used to separate rows in a DataFrame.
- Common separators include
    - Comma (`sep=','`),
    - Single whitespace (`sep='\s'`),
    - Multiple whitespace (`sep='\s+'`),
    - Tab (`sep='\t'`),
    - Colon (`sep=':'`) 

In [4]:
# 1.2.1


In [5]:
# 1.2.2


In [6]:
# 1.2.3


### 1.3. Import Excel files (single sheet)

<font size='4'>

- For excel files (.xls and .xlsx), use `pd.read_excel()` function and fill in with the file path.
- You can specify the `header`. It has a default value of `0`, which denotes the first row as headers or column names.
- You can also specify column names as a list in the `names` argument.
- The `index_col` (default is `None`) argument can be used if the file contains a row index.
    - In a pd dataframe or series, the index is an identifier that points to the location of a row or column.
    - You can access to a specific row or column by using its index.
    - We will learn it more later.

In [7]:
# 1.3.1


### 1.4. Import Excel files (multiple sheets)

<font size='4'>

- As you may know, the `ptsd_df` above is an empty dataframe. That is because I manually created an empty one in the first tab.
- To read the excel file with a particular tab name, simply specify the argument `sheet_name`. You can either pass the actual name (in a string) or an integer for the sheet position.
- Note that the Python uses `0`-indexing.

In [8]:
# 1.4.1


In [9]:
# 1.4.2


### 1.5. Import JSON file

<font size='4'>

- Similar to .csv file, you use `pd.read_json()` function for JSON file.
- A special trick to quickly identify the file directory using `*` and `glob.glob()` function.

In [10]:
# 1.5.1


## 2. Outputting data in pandas

### 2.1. Outputting a DataFrame into a CSV file

<font size='4'>

- Suppose that we have created a dataframe `test_df` and we want to save it as a CSV file, we use `to_csv()` method.
- The arguments include `path_or_buf` filename with path and `index`, where `index=True` implies including a separate column for the dataframe's index. It can also be `False` or `None`.
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html 

In [11]:
# 2.1.1


<font size='4'>

- Open the APIs for `read_csv()` and `to_csv()` above.
- A small distinction between `pd.read_csv()` and `pd.DataFrame.to_csv()` for their API references:
- The first one (is a **function**) implies that it does not rely on an existing dataframe, while the second one (is a **method**) implies that it has to be called based on an existing dataframe.
    - For our example, you import a new data file to your working environment, you simply write `xxx = pd.read_csv()`.
    - You save your existing dataframe to a CSV file, you need to add `existing_df.to_csv()`.

### 2.2. Outputting a DataFrame into a text file

<font size='4'>

- Similar to CSV file, we use `to_csv()` method.
- When saving the output file format in `.txt`, you specify a separator using the `sep` argument.

In [12]:
# 2.2.1


### 2.3. Outputting a DataFrame into a Excel file

<font size='4'>

- Similar to a `.xls` or `.xlsx` file, we use `to_excel()` method.

In [13]:
# 2.3.1


### 2.4. Outputting a DataFrame into a JSON file

<font size='4'>

- Similar to a `.json` file, we use `to_json()` method.

In [14]:
# 2.4.1


## 3. View and Understand DataFrames using Pandas

### 3.1. Head and Tail Methods
<font size='4'>

- Similar to functions in R, you can view the first few or last few rows of a DataFrame using the `.head()` or `.tail()` methods, respectively.
- You specify the number of rows through `n` argument (default value is 5).

In [15]:
# 3.1.1


In [16]:
# 3.1.2


### 3.2. Describe Method

<font size='4'>

- The `.describe()` method prints the summary statistics of all numeric columns, such as count, mean, std, range, and IQR.
- It gives a quick look at the scale, skew, and range of numeric data.

In [17]:
# 3.2.1


<font size='4'>

- You can modify the quartiles using `percentiles` argument. The input argument takes a list of values between 0 and 1.

In [18]:
# 3.2.2


<font size='4'>

- You can inlcude or exclude specific data types in the summary output.

In [19]:
# 3.2.3


In [20]:
# 3.2.4


<font size='4'>
    
- Pandas Cheatsheet for data wrangling in Python: 
- https://www.datacamp.com/cheat-sheet/pandas-cheat-sheet-data-wrangling-in-python

### 3.3. Info Method

<font size='4'>

- The `.info()` method: A quick way to look at data types, missing values, and data size of a DataFrame.
- Some frequently used parameters: `show_counts`, `memory_usage`, and `verbose`.

In [21]:
# 3.3.1


### 3.4. Shape Attribute
<font size='4'>
    
- The number of rows and columns of a DataFrame can be determined using the `.shape` attribute.
- An attribute is a feature or property of a specific python object. It does not have `()` because it is fixed once the object is specified.
- It returns a tuple (row, column) and can be indexed to only obtain rows, and only columns as output.

In [22]:
# 3.4.1


In [23]:
# 3.4.2


In [24]:
# 3.4.3


In [25]:
# 3.4.4


In [26]:
# 3.4.5


### 3.5. Get all columns and their column names

<font size='4'>

- The `.columns` attribute of a DataFrame object returns the column names in the form of an `Index` object.
- A pandas index is the address/label of the row or column.
- You previously converted it to a list using a `list()` function.


In [27]:
# 3.5.1


### 3.6. Check for missing values

<font size='4'>

- The `.copy()` method makes a copy of the original DataFrame.
- This is done to ensure that any changes to the copy do not reflect in the original DataFrame.
- Using `.loc`, you can modify the values with given rows and column names, i.e., `NaN`. (`NaN` denotes missing values.)

In [28]:
# 3.6.1


<font size='4'>

- You can check whether each element in a DataFrame is missing using `.isnull()` method.
- You can combine `.isnull()` and `.sum()` to count the number of nulls in each column.

In [29]:
# 3.6.2


In [30]:
# 3.6.3


In [31]:
# 3.6.4


## 4. Sorting, Slicing, and Extracting Data in pandas

### 4.1. Sorting

<font size='4'>

- To sort a DataFrame by a specific column, use `.sort_values()` method.
- `inplace` argument refers to performing an operation "in-place" means modifying the original data structure or object directly, without creating a separate copy of it.

In [32]:
# 4.1.1


In [33]:
# 4.1.2


### 4.2. Resetting the index

<font size='4'>

- If you filter or sort a DataFrame, your index might become misaligned. Use `.reset_index()` to fix this.

In [34]:
# 4.2.1


### 4.3. Filtering data using conditions

<font size='4'>

- Use `[]` to specify conditions

In [35]:
# 4.3.1


### 4.4. Isolating one column using `[ ]`
<font size='4'>

- You can isolate a single column using a square bracket `[]` with a column name in it.
- The output is a pandas `Series` object.
- A pandas Series is a one-dimensional array containing data of any type, including integer, float, string, boolean, python objects, etc.
- A DataFrame is comprised of many series that act as columns.

In [36]:
# 4.4.1


### 4.5. Isolating one column using `[[ ]]`

<font size='4'>

- Isolating two or more columns using `[[ ]]`
- You can provide a `list` of columns inside the square brackets to fetch more than one column.
- Here, square brackets had two functions:
- The outer square brackets indicate a subset of a DataFrame.
- The inner suqare brackets is to create a list.

In [37]:
# 4.5.1


### 4.6. Isolating one row using `[ ]`

<font size='4'>

- We have talked about subsetting columns. What about subsetting rows?
- A single row can be fetched by passing in a boolean series with one `True` value.
- For example, let's select the second row `index=1`.

In [38]:
# 4.6.1


### 4.7. Isolating two or more rows using `[ ]`

<font size='4'>

- Similarly, we use `[ ]` to isolate two or more rows and `.isin()` method instead of `==` operator.

In [39]:
# 4.7.1

# Notice that range(2, 10) has lower inclusive but upper exclusive.

### 4.8. Use `.loc[]` and `.iloc[]`

<font size='4'>

- Use `.loc[]` and `.iloc[]` to fetch rows
- `.loc[]` uses a label to point to a row, column, or cell
- `.iloc[]` uses the numeric position.

In [40]:
# 4.8.1


In [41]:
# 4.8.2


In [42]:
# 4.8.3


In [43]:
# 4.8.4

# it returns a Series object

In [44]:
# 4.8.5

# always start from 0, its absolute numeric index.

In [45]:
# 4.8.6


In [46]:
# 4.8.7


<font size='4'>

- You can subset using a list instead of a range.

In [47]:
# 4.8.8


In [48]:
# 4.8.9


<font size='4'>

- You can also select specific columns along with rows.
- `loc[]` requires all labels, while `iloc[]` requires all numbers to index the locations.
    - You can use either list or numpy array. For numpy array, make sure they are all integers.

In [49]:
# 4.8.10


In [50]:
# 4.8.11


In [51]:
# 4.8.12


<font size='4'>

- You can update/modify certain values by using the assignment opertaor `=`.

In [52]:
# 4.8.13

# We want to change the third mpg from NaN to 16.0.

In [53]:
# 4.8.14


In [54]:
# 4.8.15


In [55]:
# 4.8.16
# Or we can use iloc[], make sure all inputs are integers (starting from 0).


### 4.9. Conditional slicing

<font size='4'>

- For example, we want to find the rows where **cylinders** are 6.
- We isolate rows using the square bractes `[]` and use equal operator `==` to identify cylinders are 6.

In [56]:
# 4.9.1


In [57]:
# 4.9.2


## 5. Debugging in PyCharm

<font size='4'>

https://www.jetbrains.com/help/pycharm/debugging-your-first-python-application.html#summary 