# And now introducing Pandas...

- Named after a cute animal?
- Open source (free) Python library.
- Most widely used package for data analysis.
- Future-proof: used for AI and machine learning tasks.

# Import & Explore

Before we can analyze data, we have to learn how to:

- import different types of spreadsheet files,
- get a quick sense of what our data holds,
- explore different sections of our data in Pandas context.
- call different rows of data

## Import Pandas

Unlike the "Vanilla Python" we've written so far, we have to import Pandas which brings ALL its functionality into our iPython Notebook (```.ipynb```).


In [1]:
## import pandas
import pandas as pd

## Import CSV data into our Colab notebook

Download this data:


*   <a href="https://github.com/sandeepmj/datasets/blob/main/importing/insurance.csv">Insurance by region</a>



### Read a CSV file into your notebook

#### Bring in ```insurance.csv``` into Pandas

- You must provide a ```path``` to the file you are importing.

Syntax: ```pd.read_csv("path_to_file")```

In [5]:
## simply read the csv file when data file is at the same level
# pd.read_csv("insurance.csv")
pd.read_csv("data-raw/insurance.csv")

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


#### Folder path

Now duplicate ```insurance.csv``` and move it into a folder called ```data_raw```.



In [7]:
## simply provide the path to read the csv file when data file is in a different folder
pd.read_csv("data-raw/insurance.csv")


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


### Read and hold CSV content into a ```dataframe```

The naming convention is to call it a ```df```

In [9]:
## read and store csv in a datafram
df = pd.read_csv("data-raw/insurance.csv")

## Explore our dataframe

In [11]:
## call the df
## returns the first 5 and last 5 rows
df

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


In [13]:
## type df
type(df)


pandas.core.frame.DataFrame

In [15]:
## call the top only
## returns first 5
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [19]:
## call the top n rows
df.head(3)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462


In [21]:
## call the last 5
df.tail()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
1333,50,male,30.97,3,no,northwest,10600.5483
1334,18,female,31.92,0,no,northeast,2205.9808
1335,18,female,36.85,0,no,southeast,1629.8335
1336,21,female,25.8,0,no,southwest,2007.945
1337,61,female,29.07,0,yes,northwest,29141.3603


In [23]:
## call the last n
df.tail(7)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
1331,23,female,33.4,0,no,southwest,10795.93733
1332,52,female,44.7,3,no,southwest,11411.685
1333,50,male,30.97,3,no,northwest,10600.5483
1334,18,female,31.92,0,no,northeast,2205.9808
1335,18,female,36.85,0,no,southeast,1629.8335
1336,21,female,25.8,0,no,southwest,2007.945
1337,61,female,29.07,0,yes,northwest,29141.3603


In [31]:
## call a random sample n number of rows
df.sample(10)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
134,20,female,28.785,0,no,northeast,2457.21115
435,60,male,33.11,3,no,southeast,13919.8229
970,50,female,28.16,3,no,southeast,10702.6424
393,49,male,31.35,1,no,northeast,9290.1395
1133,52,female,18.335,0,no,northwest,9991.03765
1185,45,male,23.56,2,no,northeast,8603.8234
806,40,female,41.42,1,no,northwest,28476.73499
1237,58,female,28.215,0,no,northwest,12224.35085
56,58,female,31.825,2,no,northeast,13607.36875
662,32,female,31.54,1,no,northeast,5148.5526


In [33]:
## get basic overview of the data
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


### Dataframe objects
Like "Vanilla Python" floats are floats, integers are integers.

The big difference is that a ```string``` value is known as an ```object```.

## Importing a simple Excel file

Download this    <a href="https://github.com/sandeepmj/datasets/blob/main/importing/global_life.xlsx">Life expectancy called ```global_life.xlsx``` </a> and then read it into Pandas

In [35]:
## step 1. confirm that read returns a df
## step 2. place it into a df, call the df
df1 = pd.read_excel("data-raw/global_life.xlsx")

In [37]:
## call the top 15 values
df1.head(15)

Unnamed: 0,Country Name,Country Code
0,Aruba,ABW
1,Afghanistan,AFG
2,Angola,AGO
3,Albania,ALB
4,Andorra,AND
5,Arab World,ARB
6,United Arab Emirates,ARE
7,Argentina,ARG
8,Armenia,ARM
9,American Samoa,ASM


In [39]:
## call the last 3 values
df1.tail(3)

Unnamed: 0,Country Name,Country Code
261,South Africa,ZAF
262,Zambia,ZMB
263,Zimbabwe,ZWE


In [41]:
## show 13 random rows
df1.sample(13)

Unnamed: 0,Country Name,Country Code
228,East Asia & Pacific (IDA & IBRD countries),TEA
208,Sierra Leone,SLE
101,IDA & IBRD total,IBT
72,Fragile and conflict affected situations,FCS
50,Cayman Islands,CYM
61,East Asia & Pacific,EAS
98,Haiti,HTI
5,Arab World,ARB
10,Antigua and Barbuda,ATG
207,Solomon Islands,SLB


## Open this ```.xlsx``` file using Excel

What do you notice?

## Read a specific sheet in an Excel file

From ```global_life.xlsx```, import only ```global life expectancy``` sheet (Target  from an .xlsx file)

```pd.read_excel("path_to_file", sheet_name = "name of sheet")```

In [43]:
## read the excel file and provide the sheet_name parameter
df2 = pd.read_excel("data-raw/global_life.xlsx", sheet_name = "global life expectancy")
df2

Unnamed: 0,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,1966,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
0,ABW,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,65.662,66.074,66.444,66.787,67.113,67.435,67.762,...,75.15800,75.299000,75.441000,75.583000,75.725000,75.868000,76.010000,76.152000,,
1,AFG,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,32.446,32.962,33.471,33.971,34.463,34.948,35.430,...,61.55300,62.054000,62.525000,62.966000,63.377000,63.763000,64.130000,64.486000,,
2,AGO,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,37.524,37.811,38.113,38.430,38.760,39.102,39.454,...,56.33000,57.236000,58.054000,58.776000,59.398000,59.925000,60.379000,60.782000,,
3,ALB,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,62.283,63.301,64.190,64.914,65.463,65.850,66.110,...,76.91400,77.252000,77.554000,77.813000,78.025000,78.194000,78.333000,78.458000,,
4,AND,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
259,XKX,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,,,,,,,,...,70.14878,70.497561,70.797561,71.097561,71.346341,71.646341,71.946341,72.195122,,
260,YEM,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,29.919,30.163,30.500,30.943,31.501,32.175,32.960,...,65.76800,65.920000,66.016000,66.066000,66.085000,66.087000,66.086000,66.096000,,
261,ZAF,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,48.406,48.777,49.142,49.509,49.888,50.284,50.705,...,58.89500,60.060000,61.099000,61.968000,62.649000,63.153000,63.538000,63.857000,,
262,ZMB,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,46.687,47.084,47.446,47.772,48.068,48.351,48.643,...,57.12600,58.502000,59.746000,60.831000,61.737000,62.464000,63.043000,63.510000,,


But if you have a Excel workbook with hundred of sheets, you won't want to type in the name of each one. You'll also want to automate it. So we'll reference the sheet with as a number instead:

```pd.read_excel("path_to_file", sheet_name = sheet_number)```

In [56]:
## read the excel file and provide the sheet_name parameter
df3 = pd.read_excel("data-raw/global_life.xlsx", sheet_name = 1)
df3

Unnamed: 0,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,1966,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
0,ABW,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,65.662,66.074,66.444,66.787,67.113,67.435,67.762,...,75.15800,75.299000,75.441000,75.583000,75.725000,75.868000,76.010000,76.152000,,
1,AFG,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,32.446,32.962,33.471,33.971,34.463,34.948,35.430,...,61.55300,62.054000,62.525000,62.966000,63.377000,63.763000,64.130000,64.486000,,
2,AGO,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,37.524,37.811,38.113,38.430,38.760,39.102,39.454,...,56.33000,57.236000,58.054000,58.776000,59.398000,59.925000,60.379000,60.782000,,
3,ALB,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,62.283,63.301,64.190,64.914,65.463,65.850,66.110,...,76.91400,77.252000,77.554000,77.813000,78.025000,78.194000,78.333000,78.458000,,
4,AND,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
259,XKX,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,,,,,,,,...,70.14878,70.497561,70.797561,71.097561,71.346341,71.646341,71.946341,72.195122,,
260,YEM,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,29.919,30.163,30.500,30.943,31.501,32.175,32.960,...,65.76800,65.920000,66.016000,66.066000,66.085000,66.087000,66.086000,66.096000,,
261,ZAF,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,48.406,48.777,49.142,49.509,49.888,50.284,50.705,...,58.89500,60.060000,61.099000,61.968000,62.649000,63.153000,63.538000,63.857000,,
262,ZMB,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,46.687,47.084,47.446,47.772,48.068,48.351,48.643,...,57.12600,58.502000,59.746000,60.831000,61.737000,62.464000,63.043000,63.510000,,


## Special import cases

Download this <a href="https://github.com/sandeepmj/datasets/blob/main/importing/SBA_Disaster_Loan_Data_FY18.xlsx">SBA disaster loan FY 2018</a> file.


Import the sheet ```FY18 Business``` from ```sba-disaster-loans-18.xlsx```.

In [66]:
df4 = pd.read_excel("data-raw/SBA_Disaster_Loan_Data_FY18.xlsx", sheet_name = 4)
df4

Unnamed: 0,SBA Disaster Loan Data,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Report Run Date: 3/19/2019,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14
0,,,,,,,,Report Run Time: 9:57:31AM,,,,,,,
1,Reporting Period: 10/01/2017 - 09/30/2018,,,,,,,,,,,,,,
2,Declaration Type(s): 'ALL',,,,,,,,,,,,,,
3,SBA Physical Declaration Number,SBA EIDL Declaration Number,FEMA Disaster Number,SBA Disaster Number,Damaged Property City Name,Damaged Property Zip Code,Damaged Property County/Parish Name,Damaged Property State Code,Total Verified Loss,Verified Loss Real Estate,Verified Loss Content,Total Approved Loan Amount,Approved Amount Real Estate,Approved Amount Content,Approved Amount EIDL
4,,15337,,ZZ-00013,LOUISVILLE,40299,JEFFERSON,KY,0,0,0,49400,0,0,49400
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
825,15622,15623,4382,CA-00288,SHASTA,96087,Shasta,CA,227915,198265,29650,195100,161400,29700,4000
826,15626,15627,,NE-00071,PENDER,68047,Thurston,NE,7592,7592,0,7600,7600,0,
827,15642,15643,,NM-00062,SANTA FE,87505,Santa Fe,NM,24472,22872,1600,24500,22900,1600,
828,,,,,,,,,,,,,,,


## Does anything appear strange?

## Read a specific sheet in an Excel file but **this time skip the first few rows of formatting**

From ```SBA_Disaster_Loan_Data_FY18.xlsx```, import ```FY18 Business``` sheet (You have to skip the first 5 rows)


In [68]:
## read the excel file, provide the sheet_name parameter AND the skiprows parameter
## once it works, store in df
df4 = pd.read_excel("data-raw/SBA_Disaster_Loan_Data_FY18.xlsx", sheet_name = 4, skiprows = 4)
df4

Unnamed: 0,SBA Physical Declaration Number,SBA EIDL Declaration Number,FEMA Disaster Number,SBA Disaster Number,Damaged Property City Name,Damaged Property Zip Code,Damaged Property County/Parish Name,Damaged Property State Code,Total Verified Loss,Verified Loss Real Estate,Verified Loss Content,Total Approved Loan Amount,Approved Amount Real Estate,Approved Amount Content,Approved Amount EIDL
0,,15337.0,,ZZ-00013,LOUISVILLE,40299.0,JEFFERSON,KY,0.0,0.0,0.0,49400.0,0.0,0.0,49400.0
1,,15337.0,,ZZ-00013,NORWOOD YOUNG AMERICA,55368.0,CARVER,MN,0.0,0.0,0.0,49700.0,0.0,0.0,49700.0
2,,15359.0,,FL-00133,ALTAMONTE SPRINGS,32714.0,SEMINOLE,FL,0.0,0.0,0.0,7000.0,0.0,0.0,7000.0
3,,15359.0,,FL-00133,APOPKA,32712.0,ORANGE,FL,0.0,0.0,0.0,25000.0,0.0,0.0,25000.0
4,,15359.0,,FL-00133,AVENTURA,33180.0,MIAMI-DADE,FL,0.0,0.0,0.0,162300.0,0.0,0.0,162300.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
821,15622,15623.0,4382.0,CA-00288,SHASTA,96087.0,Shasta,CA,227915.0,198265.0,29650.0,195100.0,161400.0,29700.0,4000.0
822,15626,15627.0,,NE-00071,PENDER,68047.0,Thurston,NE,7592.0,7592.0,0.0,7600.0,7600.0,0.0,
823,15642,15643.0,,NM-00062,SANTA FE,87505.0,Santa Fe,NM,24472.0,22872.0,1600.0,24500.0,22900.0,1600.0,
824,,,,,,,,,,,,,,,


## You can also read data directly from the web

```pd.read_csv("link_of_url")```

In [71]:
pd.read_csv("https://raw.githubusercontent.com/sandeepmj/datasets/refs/heads/main/importing/insurance.csv")

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500
