In [None]:
%%html
<style>
h1, h2, h3, h4, h5 {
    color: darkblue;
    font-weight: bold !important;
}
h2 {
    border-bottom: 8px solid darkblue !important;
    padding-bottom: 8px;
}
h3 {
    border-bottom: 2px solid darkblue !important;
    padding-bottom: 6px;
}
.info, .success, .warning, .error {
    border: 1px solid;
    margin: 10px 0px;
    padding:15px 10px;
}
.info {
    color: #00529b;
    background-color: #bde5f8;
}
.success {
    color: #4f8a10;
    background-color: #dff2bf;
}
.warning {
    color: #9f6000;
    background-color: #FEEFB3;
}
.error {
    color: #D8000C;
    background-color: #FFBABA;
}
.language-bash {
    font-weight: 900;
}
.ex {
    font-weight: 900;
    color: rgba(27,27,255,0.87) !important;
}
.mn {
    font-family: Menlo, Consolas, "DejaVu Sans Mono", monospace
}
table {
    margin-left: 0 !important;}
</style>

# Day 2: Up and Running with Python

## 2.7 Pandas

### <span class='mn'>Pandas.read_csv()</span>

-   Pandas provides a rich set of functions to access data from a variety of sources, such as SQL, SQL databases, Excel, etc.


-   `pandas.read_csv()` allows us to parse CSV files in different format into memory.


-   Pandas provides `pandas.read_csv()` to parse a local copy of CSV file or from a given URL.

Let's try to load the same CSV file from
1.  https://raw.githubusercontent.com/yoonghm/nawp/master/SalesJan2009.csv
2.  A local copy of `SalesJan2009.csv`

In [None]:
!dir SalesJan2009.csv

#### Simple <span class='mn'>pandas.read_csv()</span>

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv('SalesJan2009.csv') # Read from local file
# Read the online URL as df1 in memory
df1 = pd.read_csv('https://raw.githubusercontent.com/yoonghm/nawp/master/SalesJan2009.csv')

In [None]:
df.head(3)

In [None]:
df1.head(3)  # See the top three rows

In [None]:
df1.tail(2)  # See the bottom five rows

#### Parse columns with the correct datatypes with <span class='mn'>pandas.read_csv()</span>

In [None]:
df1.info()  # See how Pandas parses the data

It is noticed that the following columns were not parsed as datetime objects
-   `Transaction_date` (0-index column), 
-   `Amount_Created` (8-index column),
-   `Last_Login` (9-index column)

and the following column was not parsed as integer objects
-   `Price` (2-index column)

There is a missing value for `State`.

Let's use `parse_dates` parameters in `pandas.read_csv()` to parse datetime columns.

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime

df1 = pd.read_csv(
    'https://raw.githubusercontent.com/yoonghm/nawp/master/SalesJan2009.csv',
    parse_dates=[0, 8, 9]
)

In [None]:
df1.info()

In [None]:
df1.head()

Let's convert `Price` column to integer.

In [None]:
df1['Price'].unique()

In [None]:
df['Price'] == '13,000'

In [None]:
df1[df['Price'] == '13,000']

In [None]:
df1.loc[558,'Price'] = 13000

In [None]:
df1[558:559]

In [None]:
df1['Price'] = df1['Price'].astype(int)

In [None]:
df1.info()

We could also search for `','` and replace it with `''` before convert the type again.

In [None]:
df1['Price'] = df1['Price'].str.replace(',', '').astype(int)

In [None]:
df1.to_csv('backup.csv')

In [None]:
df1.info()

Find out more information on the missing value in `State` column.

In [None]:
df1['State'].isnull()

In [None]:
df1[df1['State'].isnull()]  # List records from index with missing data in `State` column

### Statistics from Numerical Columns

-   We could use `pandas.DataFrame.describe()` to find out statistics from numerical columns

In [None]:
df1.describe()

### Display Rows using Index

In [None]:
df1[2:8]  # Show continous rows from id = 2 to id=7

In [None]:
df1.loc[[2,4,6,8]]  # Show disjointed rows using DataFrame.loc

In [None]:
df1.loc[[1,3,5,7],['City','State','Country']]  # Show disjointed rows and particular columns

In [None]:
df1.iloc[:] # Show all rows and columns

In [None]:
df1.iloc[1:4, 0:2] # Show 2nd to 3rd rows of 0th to 1st columns

In [None]:
df1.tail(2)  # Show the last 2 rows from df1

### Display Rows Sorted by Columns

In [None]:
# Return a new df with sorted rows by Payment_Type, 
# followed by Country in ascending orders
df1.sort_values(by=['Payment_Type', 'Country'])

### Display Rows Sorted by Columns in Different Orders

In [None]:
# Return a new df with sorted rows by Payment_Type in descending order,
# followed by Country in ascending order
df1.sort_values(by=['Payment_Type', 'Country'], ascending=[False, True])

### Display Top 5 Rows Sorted by Columns in Different Orders

In [None]:
# Return the first 5 rows after sorted by Payment_Type,
# followed by Country
df1.sort_values(by=['Payment_Type', 'Country'])[0:5]

### Display the Top 5 Highest Transactions with Payment Type

In [None]:
# Return the top 5 highest transaction with payment type
# followed by Country
df1.sort_values(by=['Price'], ascending=[False])[0:5][['Price','Payment_Type']]

### Display The Top 5 Highest Transaction in United States

In [None]:
# Return the top 5 highest transaction in United Kingdom
df1[df1.Country == 'United States'].sort_values(by=['Price'], ascending=False)[0:5]

### Loop Through The Top 5 Highest Transaction in United States

In [None]:
top5_df = df1[df1.Country == 'United States'].sort_values(by=['Price'], ascending=False)[0:5]
for d in top5_df.itertuples():
    print(f'{d.Name:<30} {d.City:<15} {d.Country:<20}')

In [None]:
df2 = df1.sort_values(by=['Price'], ascending=[False])[0:5]

for d in df2.itertuples():  # To loop through every row
    print(f'{d.Country:<30} {d.Price}')