In [1]:
%%html
<style>
h1, h2, h3, h4, h5 {
    color: darkblue;
    font-weight: bold !important;
}
h2 {
    border-bottom: 8px solid darkblue !important;
    padding-bottom: 8px;
}
h3 {
    border-bottom: 2px solid darkblue !important;
    padding-bottom: 6px;
}
.info, .success, .warning, .error {
    border: 1px solid;
    margin: 10px 0px;
    padding:15px 10px;
}
.info {
    color: #00529b;
    background-color: #bde5f8;
}
.success {
    color: #4f8a10;
    background-color: #dff2bf;
}
.warning {
    color: #9f6000;
    background-color: #FEEFB3;
}
.error {
    color: #D8000C;
    background-color: #FFBABA;
}
.language-bash {
    font-weight: 900;
}
.ex {
    font-weight: 900;
    color: rgba(27,27,255,0.87) !important;
}
.mn {
    font-family: Menlo, Consolas, "DejaVu Sans Mono", monospace
}
table {
    margin-left: 0 !important;}
</style>

# Day 2: Up and Running with Python

## 2.7 Pandas

### <span class='mn'>Pandas.read_csv()</span>

-   Pandas provides a rich set of functions to access data from a variety of sources, such as SQL, SQL databases, Excel, etc.


-   `pandas.read_csv()` allows us to parse CSV files in different format into memory.


-   Pandas provides `pandas.read_csv()` to parse a local copy of CSV file or from a given URL.

Let's try to load the same CSV file from
1.  https://raw.githubusercontent.com/yoonghm/nawp/master/SalesJan2009.csv
2.  A local copy of `SalesJan2009.csv`

In [None]:
!dir SalesJan2009.csv

#### Simple <span class='mn'>pandas.read_csv()</span>

In [27]:
import pandas as pd
import numpy as np

# Read the online URL as df1 in memory
df1 = pd.read_csv('https://raw.githubusercontent.com/yoonghm/nawp/master/SalesJan2009.csv')

In [38]:
df1.head()  # See the top five rows

Unnamed: 0,Transaction_date,Product,Price,Payment_Type,Name,City,State,Country,Account_Created,Last_Login,Latitude,Longitude
0,1/2/09 6:17,Product1,1200,Mastercard,carolina,Basildon,England,United Kingdom,1/2/09 6:00,1/2/09 6:08,51.5,-1.116667
1,1/2/09 4:53,Product1,1200,Visa,Betina,Parkville,MO,United States,1/2/09 4:42,1/2/09 7:49,39.195,-94.68194
2,1/2/09 13:08,Product1,1200,Mastercard,Federica e Andrea,Astoria,OR,United States,1/1/09 16:21,1/3/09 12:32,46.18806,-123.83
3,1/3/09 14:44,Product1,1200,Visa,Gouya,Echuca,Victoria,Australia,9/25/05 21:13,1/3/09 14:22,-36.133333,144.75
4,1/4/09 12:56,Product2,3600,Visa,Gerd W,Cahaba Heights,AL,United States,11/15/08 15:47,1/4/09 12:45,33.52056,-86.8025


In [39]:
df1.tail()  # See the bottom five rows

Unnamed: 0,Transaction_date,Product,Price,Payment_Type,Name,City,State,Country,Account_Created,Last_Login,Latitude,Longitude
993,1/22/09 14:25,Product1,1200,Visa,Hans-Joerg,Belfast,Northern Ireland,United Kingdom,11/10/08 12:15,3/1/09 3:37,54.583333,-5.933333
994,1/28/09 5:36,Product2,3600,Visa,Christiane,Black River,Black River,Mauritius,1/9/09 8:10,3/1/09 4:40,-20.360278,57.366111
995,1/1/09 4:24,Product3,7500,Amex,Pamela,Skaneateles,NY,United States,12/28/08 17:28,3/1/09 7:21,42.94694,-76.42944
996,1/8/09 11:55,Product1,1200,Diners,julie,Haverhill,England,United Kingdom,11/29/06 13:31,3/1/09 7:28,52.083333,0.433333
997,1/12/09 21:30,Product1,1200,Visa,Julia,Madison,WI,United States,11/17/08 22:24,3/1/09 10:14,43.07306,-89.40111


#### Parse columns with the correct datatypes with <span class='mn'>pandas.read_csv()</span>

In [10]:
df1.info()  # See how Pandas parses the data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 998 entries, 0 to 997
Data columns (total 12 columns):
Transaction_date    998 non-null object
Product             998 non-null object
Price               998 non-null object
Payment_Type        998 non-null object
Name                998 non-null object
City                998 non-null object
State               997 non-null object
Country             998 non-null object
Account_Created     998 non-null object
Last_Login          998 non-null object
Latitude            998 non-null float64
Longitude           998 non-null float64
dtypes: float64(2), object(10)
memory usage: 93.7+ KB


It is noticed that the following columns were not parsed as datetime objects
-   `Transaction_date` (0-index column), 
-   `Amount_Created` (8-index column),
-   `Last_Login` (9-index column)

and the following column was not parsed as integer objects
-   `Price` (2-index column)

There is a missing value for `State`.

Let's use `parse_dates` parameters in `pandas.read_csv()` to parse datetime columns.

In [17]:
import pandas as pd
import numpy as np
from datetime import datetime

df1 = pd.read_csv(
    'https://raw.githubusercontent.com/yoonghm/nawp/master/SalesJan2009.csv',
    parse_dates=[0, 8, 9]
)

In [19]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 998 entries, 0 to 997
Data columns (total 12 columns):
Transaction_date    998 non-null datetime64[ns]
Product             998 non-null object
Price               998 non-null object
Payment_Type        998 non-null object
Name                998 non-null object
City                998 non-null object
State               997 non-null object
Country             998 non-null object
Account_Created     998 non-null datetime64[ns]
Last_Login          998 non-null datetime64[ns]
Latitude            998 non-null float64
Longitude           998 non-null float64
dtypes: datetime64[ns](3), float64(2), object(7)
memory usage: 93.7+ KB


Let's convert `Price` column to integer.

In [None]:
df1['Price'] = df1['Price'].astype(int)

Pandas could not convert `13,000`. Let's replace `,` with `` before convert the type again.

In [34]:
df1['Price'] = df1['Price'].str.replace(',', '').astype(int)

In [35]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 998 entries, 0 to 997
Data columns (total 12 columns):
Transaction_date    998 non-null object
Product             998 non-null object
Price               998 non-null int32
Payment_Type        998 non-null object
Name                998 non-null object
City                998 non-null object
State               997 non-null object
Country             998 non-null object
Account_Created     998 non-null object
Last_Login          998 non-null object
Latitude            998 non-null float64
Longitude           998 non-null float64
dtypes: float64(2), int32(1), object(9)
memory usage: 89.8+ KB


Find out more information on the missing value in `State` column.

In [32]:
df1[df1['State'].isnull()]  # List records from index with missing data in `State` column

Unnamed: 0,Transaction_date,Product,Price,Payment_Type,Name,City,State,Country,Account_Created,Last_Login,Latitude,Longitude
146,1/17/09 1:23,Product1,1200,Amex,Campbell,Mushrif,,Kuwait,1/16/09 14:26,1/18/09 9:08,29.289167,48.05


### Statistics from Numerical Columns

-   We could use `pandas.DataFrame.describe()` to find out statistics from numerical columns

In [36]:
df1.describe()

Unnamed: 0,Price,Latitude,Longitude
count,998.0,998.0,998.0
mean,1633.767535,39.015705,-41.33782
std,1156.034724,19.508572,67.389479
min,250.0,-41.465,-159.48528
25%,1200.0,35.816944,-87.99167
50%,1200.0,42.320695,-73.730695
75%,1200.0,51.05,4.916667
max,13000.0,64.83778,174.766667


### Display Rows using Index

In [46]:
df1[2:8]  # Show continous rows from id = 2 to id=7

Unnamed: 0,Transaction_date,Product,Price,Payment_Type,Name,City,State,Country,Account_Created,Last_Login,Latitude,Longitude
2,1/2/09 13:08,Product1,1200,Mastercard,Federica e Andrea,Astoria,OR,United States,1/1/09 16:21,1/3/09 12:32,46.18806,-123.83
3,1/3/09 14:44,Product1,1200,Visa,Gouya,Echuca,Victoria,Australia,9/25/05 21:13,1/3/09 14:22,-36.133333,144.75
4,1/4/09 12:56,Product2,3600,Visa,Gerd W,Cahaba Heights,AL,United States,11/15/08 15:47,1/4/09 12:45,33.52056,-86.8025
5,1/4/09 13:19,Product1,1200,Visa,LAURENCE,Mickleton,NJ,United States,9/24/08 15:19,1/4/09 13:04,39.79,-75.23806
6,1/4/09 20:11,Product1,1200,Mastercard,Fleur,Peoria,IL,United States,1/3/09 9:38,1/4/09 19:45,40.69361,-89.58889
7,1/2/09 20:09,Product1,1200,Mastercard,adam,Martin,TN,United States,1/2/09 17:43,1/4/09 20:01,36.34333,-88.85028


In [43]:
df1.loc[[2,4,6,8]]  # Show disjointed rows using DataFrame.loc

Unnamed: 0,Transaction_date,Product,Price,Payment_Type,Name,City,State,Country,Account_Created,Last_Login,Latitude,Longitude
2,1/2/09 13:08,Product1,1200,Mastercard,Federica e Andrea,Astoria,OR,United States,1/1/09 16:21,1/3/09 12:32,46.18806,-123.83
4,1/4/09 12:56,Product2,3600,Visa,Gerd W,Cahaba Heights,AL,United States,11/15/08 15:47,1/4/09 12:45,33.52056,-86.8025
6,1/4/09 20:11,Product1,1200,Mastercard,Fleur,Peoria,IL,United States,1/3/09 9:38,1/4/09 19:45,40.69361,-89.58889
8,1/4/09 13:17,Product1,1200,Mastercard,Renee Elisabeth,Tel Aviv,Tel Aviv,Israel,1/4/09 13:03,1/4/09 22:10,32.066667,34.766667


In [45]:
df1.loc[[1,3,5,7],['City','State','Country']]  # Show disjointed rows and particular columns

Unnamed: 0,City,State,Country
1,Parkville,MO,United States
3,Echuca,Victoria,Australia
5,Mickleton,NJ,United States
7,Martin,TN,United States


In [7]:
df2[0:5]  # Show the last 5 rows from df2

Unnamed: 0,Transaction_date,Product,Price,Payment_Type,Name,City,State,Country,Account_Created,Last_Login,Latitude,Longitude
0,1/2/09 6:17,Product1,1200,Mastercard,carolina,Basildon,England,United Kingdom,1/2/09 6:00,1/2/09 6:08,51.5,-1.116667
1,1/2/09 4:53,Product1,1200,Visa,Betina,Parkville,MO,United States,1/2/09 4:42,1/2/09 7:49,39.195,-94.68194
2,1/2/09 13:08,Product1,1200,Mastercard,Federica e Andrea,Astoria,OR,United States,1/1/09 16:21,1/3/09 12:32,46.18806,-123.83
3,1/3/09 14:44,Product1,1200,Visa,Gouya,Echuca,Victoria,Australia,9/25/05 21:13,1/3/09 14:22,-36.133333,144.75
4,1/4/09 12:56,Product2,3600,Visa,Gerd W,Cahaba Heights,AL,United States,11/15/08 15:47,1/4/09 12:45,33.52056,-86.8025


### Display Rows Sorted by Columns

In [48]:
# Return a new df with sorted rows by Payment_Type, 
# followed by Country in ascending orders
df1.sort_values(by=['Payment_Type', 'Country'])

Unnamed: 0,Transaction_date,Product,Price,Payment_Type,Name,City,State,Country,Account_Created,Last_Login,Latitude,Longitude
606,1/25/09 5:57,Product1,1200,Amex,pamela,Ayacucho,Buenos Aires,Argentina,1/24/09 9:29,2/10/09 6:38,-37.150000,-58.483333
186,1/4/09 1:38,Product1,1200,Amex,jeremy,Charlestown,New South Wales,Australia,7/15/08 21:04,1/20/09 1:08,-32.950000,151.666667
746,1/19/09 1:52,Product1,1200,Amex,Kirsten,Melbourne,Victoria,Australia,4/24/07 4:36,2/17/09 1:20,-37.816667,144.966667
123,1/7/09 7:19,Product1,1200,Amex,Jocelyn,Bruxelles,Brussels (Bruxelles),Belgium,6/30/08 9:33,1/16/09 8:54,50.833333,4.333333
780,1/19/09 6:50,Product1,1200,Amex,Nicola,Bruxelles,Brussels (Bruxelles),Belgium,7/13/08 11:29,2/18/09 2:30,50.833333,4.333333
...,...,...,...,...,...,...,...,...,...,...,...,...
963,1/22/09 12:45,Product3,7500,Visa,Frank and Christelle,Valley Center,CA,United States,1/22/09 10:25,2/27/09 10:49,33.218330,-117.033330
967,1/14/09 8:38,Product1,1200,Visa,jingyan,Bristol,RI,United States,3/12/07 15:14,2/27/09 17:13,41.676940,-71.266670
970,1/3/09 21:19,Product1,1200,Visa,Doug and Tina,Pls Vrds Est,CA,United States,12/24/07 22:59,2/27/09 21:40,33.800560,-118.389170
985,1/28/09 11:19,Product1,1200,Visa,christal,Morrison,CO,United States,6/20/04 17:16,2/28/09 17:18,39.653610,-105.190560


### Display Rows Sorted by Columns in Different Orders

In [51]:
# Return a new df with sorted rows by Payment_Type in descending order,
# followed by Country in ascending order
df1.sort_values(by=['Payment_Type', 'Country'], ascending=[False, True])

Unnamed: 0,Transaction_date,Product,Price,Payment_Type,Name,City,State,Country,Account_Created,Last_Login,Latitude,Longitude
3,1/3/09 14:44,Product1,1200,Visa,Gouya,Echuca,Victoria,Australia,9/25/05 21:13,1/3/09 14:22,-36.133333,144.750000
36,1/5/09 20:00,Product2,3600,Visa,James,Burpengary,Queensland,Australia,12/10/08 19:53,1/8/09 17:58,-27.166667,152.950000
55,1/11/09 2:04,Product1,1200,Visa,chris,Gold Coast,Queensland,Australia,1/11/09 0:33,1/11/09 2:11,-28.000000,153.433333
120,1/12/09 1:37,Product1,1200,Visa,IMAN,Brisbane,Queensland,Australia,1/12/09 1:26,1/15/09 17:54,-27.500000,153.016667
199,1/20/09 3:51,Product1,1200,Visa,Elizabeth,The Grange,Queensland,Australia,11/12/08 3:34,1/20/09 13:11,-24.816667,152.416667
...,...,...,...,...,...,...,...,...,...,...,...,...
935,1/26/09 8:00,Product1,1200,Amex,Elizabeth,Bluffton,SC,United States,1/23/09 11:12,2/26/09 7:49,32.236940,-80.860560
936,1/11/09 8:41,Product1,1200,Amex,Juliann,Winter Spgs,FL,United States,1/8/08 7:24,2/26/09 8:17,28.698610,-81.308330
942,1/5/09 11:24,Product1,1200,Amex,Angie,Rodeo,CA,United States,1/25/08 16:50,2/26/09 12:15,38.017220,-122.287500
951,1/13/09 7:09,Product2,3600,Amex,John,Chicago,IL,United States,1/8/09 13:49,2/26/09 15:14,41.850000,-87.650000


### Display Top 5 Rows Sorted by Columns in Different Orders

In [53]:
# Return the first 5 rows after sorted by Payment_Type,
# followed by Country
df1.sort_values(by=['Payment_Type', 'Country'])[0:5]

Unnamed: 0,Transaction_date,Product,Price,Payment_Type,Name,City,State,Country,Account_Created,Last_Login,Latitude,Longitude
606,1/25/09 5:57,Product1,1200,Amex,pamela,Ayacucho,Buenos Aires,Argentina,1/24/09 9:29,2/10/09 6:38,-37.15,-58.483333
186,1/4/09 1:38,Product1,1200,Amex,jeremy,Charlestown,New South Wales,Australia,7/15/08 21:04,1/20/09 1:08,-32.95,151.666667
746,1/19/09 1:52,Product1,1200,Amex,Kirsten,Melbourne,Victoria,Australia,4/24/07 4:36,2/17/09 1:20,-37.816667,144.966667
123,1/7/09 7:19,Product1,1200,Amex,Jocelyn,Bruxelles,Brussels (Bruxelles),Belgium,6/30/08 9:33,1/16/09 8:54,50.833333,4.333333
780,1/19/09 6:50,Product1,1200,Amex,Nicola,Bruxelles,Brussels (Bruxelles),Belgium,7/13/08 11:29,2/18/09 2:30,50.833333,4.333333


### Display the Top 5 Highest Transactions with Payment Type

In [63]:
# Return the top 5 highest transaction with payment type
# followed by Country
df1.sort_values(by=['Price'], ascending=[False])[0:5][['Price','Payment_Type']]

Unnamed: 0,Price,Payment_Type
558,13000,Visa
544,7500,Visa
963,7500,Visa
434,7500,Mastercard
493,7500,Visa


### Display The Top 5 Highest Transaction in United States

In [71]:
# Return the top 5 highest transaction in United Kingdom
df1[df1.Country == 'United States'].sort_values(by=['Price'], ascending=False)[0:5]

Unnamed: 0,Transaction_date,Product,Price,Payment_Type,Name,City,State,Country,Account_Created,Last_Login,Latitude,Longitude
558,1/28/09 18:00,Product1,13000,Visa,sandhya,Centennial,CO,United States,12/2/06 23:24,2/7/09 15:18,39.57917,-104.87639
206,1/16/09 2:41,Product3,7500,Visa,Kristyn,Kearns,UT,United States,1/15/09 18:01,1/21/09 1:02,40.66,-111.99556
963,1/22/09 12:45,Product3,7500,Visa,Frank and Christelle,Valley Center,CA,United States,1/22/09 10:25,2/27/09 10:49,33.21833,-117.03333
995,1/1/09 4:24,Product3,7500,Amex,Pamela,Skaneateles,NY,United States,12/28/08 17:28,3/1/09 7:21,42.94694,-76.42944
912,1/25/09 11:35,Product3,7500,Mastercard,Anita,Fresno,TX,United States,1/24/09 9:24,2/25/09 14:22,29.53861,-95.44722


### Loop Through The Top 5 Highest Transaction in United States

In [75]:
top5_df = df1[df1.Country == 'United States'].sort_values(by=['Price'], ascending=False)[0:5]
for d in top5_df.itertuples():
    print(f'{d.Name:<30} {d.City:<15} {d.Country:<20}')

sandhya                        Centennial                   United States       
Kristyn                        Kearns                       United States       
Frank and Christelle           Valley Center                United States       
Pamela                         Skaneateles                  United States       
Anita                          Fresno                       United States       
