<a href="https://colab.research.google.com/github/thousandoaks/Python4DS101/blob/master/labs/Pandas_Basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas Basics



## What I hope you'll get out of this lab
* The feeling that you'll "know where to start" when you see pandas-related code in lecture, or when you need to use the pandas library for an assignment.
* (You won't be a pandas expert after one hour)
* Basics to pandas operations
* Resources to look further

## Why Pandas ?
Pandas is an open source Python library for data analysis. It gives Python the ability to work with spreadsheet-like data for fast data loading, manipulating, aligning, and merging, among other functions. To give Python these enhanced features, Pandas introduces two new data types to Python: Series and DataFrame. The DataFrame represents your entire spreadsheet or rectangular data, whereas the Series is a single column of the DataFrame.

Why should you use a programming language like Python and a tool like Pandas to work with data? It boils down to automation and reproducibility. If a particular set of analyses need to be performed on multiple data sets, a programming language has the ability to automate the analysis on those data sets. Although many spreadsheet programs have their own macro programming languages, many users do not use them. Furthermore, not all spreadsheet programs are available on all operating systems.

---
## Feel free to change the code and have it executed to see the results

---

## Let's import the library


In [30]:
import pandas as pd

## Let's load some data and create a `DataFrame`

In [31]:
# we use pandas.read_csv() function to access the file "gapminder.tsv" stored in a remote location 

# the remote location is: https://raw.githubusercontent.com/thousandoaks/BEMM458/master/data/

# with the argument sep='\t' we indicate that the columns are separated by tabs rather than commas.

gapminderDataFrame = pd.read_csv('https://raw.githubusercontent.com/thousandoaks/BEMM458/master/data/gapminder.tsv', sep='\t')

In [32]:
gapminderDataFrame

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.853030
2,Afghanistan,Asia,1962,31.997,10267083,853.100710
3,Afghanistan,Asia,1967,34.020,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,Africa,1987,62.351,9216418,706.157306
1700,Zimbabwe,Africa,1992,60.377,10704340,693.420786
1701,Zimbabwe,Africa,1997,46.809,11404948,792.449960
1702,Zimbabwe,Africa,2002,39.989,11926563,672.038623


### We observe that the previous dataframe is a panel data consisted of 6 columns and 1704 observations. Each observation is a row.

In [33]:
gapminderDataFrame.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.85303
2,Afghanistan,Asia,1962,31.997,10267083,853.10071
3,Afghanistan,Asia,1967,34.02,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106


In [34]:
# we get some more detailed info on our dataset
gapminderDataFrame.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   country    1704 non-null   object 
 1   continent  1704 non-null   object 
 2   year       1704 non-null   int64  
 3   lifeExp    1704 non-null   float64
 4   pop        1704 non-null   int64  
 5   gdpPercap  1704 non-null   float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB


## Let's extract some columns from our data

In [35]:
# we can extract several columns at the same time
gapminderDataFrame[['country','lifeExp']]

Unnamed: 0,country,lifeExp
0,Afghanistan,28.801
1,Afghanistan,30.332
2,Afghanistan,31.997
3,Afghanistan,34.020
4,Afghanistan,36.088
...,...,...
1699,Zimbabwe,62.351
1700,Zimbabwe,60.377
1701,Zimbabwe,46.809
1702,Zimbabwe,39.989


## Let's extract some rows from our data

In [36]:
# let's extract the first row. Python starts counting from zero
gapminderDataFrame.loc[0]

country      Afghanistan
continent           Asia
year                1952
lifeExp           28.801
pop              8425333
gdpPercap     779.445314
Name: 0, dtype: object

In [37]:
# let's extract the second row. Python starts counting from zero
gapminderDataFrame.loc[1]

country      Afghanistan
continent           Asia
year                1957
lifeExp           30.332
pop              9240934
gdpPercap      820.85303
Name: 1, dtype: object

In [38]:
# we can even select multiple rows
listofRows=[0,5,10]

gapminderDataFrame.loc[listofRows]

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
5,Afghanistan,Asia,1977,38.438,14880372,786.11336
10,Afghanistan,Asia,2002,42.129,25268405,726.734055


In [39]:
# we can even select multiple rows by range

gapminderDataFrame.loc[0:8]

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.85303
2,Afghanistan,Asia,1962,31.997,10267083,853.10071
3,Afghanistan,Asia,1967,34.02,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106
5,Afghanistan,Asia,1977,38.438,14880372,786.11336
6,Afghanistan,Asia,1982,39.854,12881816,978.011439
7,Afghanistan,Asia,1987,40.822,13867957,852.395945
8,Afghanistan,Asia,1992,41.674,16317921,649.341395


## Let's perform some operations on columns

In [40]:
gapminderDataFrame['pop']/1000000

0        8.425333
1        9.240934
2       10.267083
3       11.537966
4       13.079460
          ...    
1699     9.216418
1700    10.704340
1701    11.404948
1702    11.926563
1703    12.311143
Name: pop, Length: 1704, dtype: float64

In [41]:
gapminderDataFrame['pop_Millions']=gapminderDataFrame['pop']/1000000

In [42]:
gapminderDataFrame[['pop','pop_Millions']]

Unnamed: 0,pop,pop_Millions
0,8425333,8.425333
1,9240934,9.240934
2,10267083,10.267083
3,11537966,11.537966
4,13079460,13.079460
...,...,...
1699,9216418,9.216418
1700,10704340,10.704340
1701,11404948,11.404948
1702,11926563,11.926563


In [43]:
gapminderDataFrame['country']

0       Afghanistan
1       Afghanistan
2       Afghanistan
3       Afghanistan
4       Afghanistan
           ...     
1699       Zimbabwe
1700       Zimbabwe
1701       Zimbabwe
1702       Zimbabwe
1703       Zimbabwe
Name: country, Length: 1704, dtype: object

In [44]:
gapminderDataFrame['country'].str.upper()

0       AFGHANISTAN
1       AFGHANISTAN
2       AFGHANISTAN
3       AFGHANISTAN
4       AFGHANISTAN
           ...     
1699       ZIMBABWE
1700       ZIMBABWE
1701       ZIMBABWE
1702       ZIMBABWE
1703       ZIMBABWE
Name: country, Length: 1704, dtype: object

In [45]:
gapminderDataFrame['COUNTRY']=gapminderDataFrame['country'].str.upper().str.upper()

In [46]:
gapminderDataFrame[['country','COUNTRY']]

Unnamed: 0,country,COUNTRY
0,Afghanistan,AFGHANISTAN
1,Afghanistan,AFGHANISTAN
2,Afghanistan,AFGHANISTAN
3,Afghanistan,AFGHANISTAN
4,Afghanistan,AFGHANISTAN
...,...,...
1699,Zimbabwe,ZIMBABWE
1700,Zimbabwe,ZIMBABWE
1701,Zimbabwe,ZIMBABWE
1702,Zimbabwe,ZIMBABWE


In [47]:
gapminderDataFrame['gdpPercap']*gapminderDataFrame['pop']

0       6.567086e+09
1       7.585449e+09
2       8.758856e+09
3       9.648014e+09
4       9.678553e+09
            ...     
1699    6.508241e+09
1700    7.422612e+09
1701    9.037851e+09
1702    8.015111e+09
1703    5.782658e+09
Length: 1704, dtype: float64

In [48]:
gapminderDataFrame['gdp']=gapminderDataFrame['gdpPercap']*gapminderDataFrame['pop']

## Let's sort the `DataFrame`

In [49]:
## Let's sort data by country in ascending order
gapminderDataFrame.sort_values(by='country',ascending=True)

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,pop_Millions,COUNTRY,gdp
0,Afghanistan,Asia,1952,28.801,8425333,779.445314,8.425333,AFGHANISTAN,6.567086e+09
11,Afghanistan,Asia,2007,43.828,31889923,974.580338,31.889923,AFGHANISTAN,3.107929e+10
10,Afghanistan,Asia,2002,42.129,25268405,726.734055,25.268405,AFGHANISTAN,1.836341e+10
9,Afghanistan,Asia,1997,41.763,22227415,635.341351,22.227415,AFGHANISTAN,1.412200e+10
7,Afghanistan,Asia,1987,40.822,13867957,852.395945,13.867957,AFGHANISTAN,1.182099e+10
...,...,...,...,...,...,...,...,...,...
1693,Zimbabwe,Africa,1957,50.469,3646340,518.764268,3.646340,ZIMBABWE,1.891591e+09
1692,Zimbabwe,Africa,1952,48.451,3080907,406.884115,3.080907,ZIMBABWE,1.253572e+09
1702,Zimbabwe,Africa,2002,39.989,11926563,672.038623,11.926563,ZIMBABWE,8.015111e+09
1696,Zimbabwe,Africa,1972,55.635,5861135,799.362176,5.861135,ZIMBABWE,4.685170e+09


In [50]:
## Let's sort data by country in descending order
gapminderDataFrame.sort_values(by='country',ascending=False)

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,pop_Millions,COUNTRY,gdp
1703,Zimbabwe,Africa,2007,43.487,12311143,469.709298,12.311143,ZIMBABWE,5.782658e+09
1697,Zimbabwe,Africa,1977,57.674,6642107,685.587682,6.642107,ZIMBABWE,4.553747e+09
1692,Zimbabwe,Africa,1952,48.451,3080907,406.884115,3.080907,ZIMBABWE,1.253572e+09
1693,Zimbabwe,Africa,1957,50.469,3646340,518.764268,3.646340,ZIMBABWE,1.891591e+09
1694,Zimbabwe,Africa,1962,52.358,4277736,527.272182,4.277736,ZIMBABWE,2.255531e+09
...,...,...,...,...,...,...,...,...,...
8,Afghanistan,Asia,1992,41.674,16317921,649.341395,16.317921,AFGHANISTAN,1.059590e+10
9,Afghanistan,Asia,1997,41.763,22227415,635.341351,22.227415,AFGHANISTAN,1.412200e+10
10,Afghanistan,Asia,2002,42.129,25268405,726.734055,25.268405,AFGHANISTAN,1.836341e+10
11,Afghanistan,Asia,2007,43.828,31889923,974.580338,31.889923,AFGHANISTAN,3.107929e+10


In [51]:
## Let's sort data by lifeExp in descending order
gapminderDataFrame.sort_values(by='lifeExp',ascending=False)

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,pop_Millions,COUNTRY,gdp
803,Japan,Asia,2007,82.603,127467972,31656.068060,127.467972,JAPAN,4.035135e+12
671,"Hong Kong, China",Asia,2007,82.208,6980412,39724.978670,6.980412,"HONG KONG, CHINA",2.772967e+11
802,Japan,Asia,2002,82.000,127065841,28604.591900,127.065841,JAPAN,3.634667e+12
695,Iceland,Europe,2007,81.757,301931,36180.789190,0.301931,ICELAND,1.092410e+10
1487,Switzerland,Europe,2007,81.701,7554661,37506.419070,7.554661,SWITZERLAND,2.833483e+11
...,...,...,...,...,...,...,...,...,...
1344,Sierra Leone,Africa,1952,30.331,2143249,879.787736,2.143249,SIERRA LEONE,1.885604e+09
36,Angola,Africa,1952,30.015,4232095,3520.610273,4.232095,ANGOLA,1.489956e+10
552,Gambia,Africa,1952,30.000,284320,485.230659,0.284320,GAMBIA,1.379608e+08
0,Afghanistan,Asia,1952,28.801,8425333,779.445314,8.425333,AFGHANISTAN,6.567086e+09


## 1.2.3 Additional Resources
* Ten Minutes to Pandas: https://pandas.pydata.org/docs/user_guide/10min.html

* Getting started tutorial: https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html



