 **PANDAS**



`pandas` is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals. 
— Wikipedia



**Applications of pandas:**

- data I/O
- data cleaning
- data wrangling
- data analysis
- data visualization

It works very well with many other python packages like 
- matplotlib, plotly
- scikit sklearn
- numpy



Jupyter Notebooks offer a good environment for using pandas to do data exploration and modeling, but pandas can also be used in text editors just as easily.

In [None]:
#install package 

pip install pandas

 OR

 conda install pandas

 OR

 !pip install pandas 


In [2]:
#import package

import pandas as pd

Two main components of a pandas are: Series and Dataframe

Python-Pandas-Tutorial-A-Complete-Introduction-for-Beginners-–-LearnDataSci.png

#Creating a Dataframe:

**1. From Lists:**

In [25]:
# initialize list of lists 
data = [['tom', 10], ['nick', 15], ['juli', 14]] 
  
# Create the pandas DataFrame 
df = pd.DataFrame(data, columns = ['Name', 'Age']) 
  
# print dataframe. 
df 

Unnamed: 0,Name,Age
0,tom,10
1,nick,15
2,juli,14


**2. From a Dictionary:** 

In [15]:
# initialize dictionary 
data = {"Name": ["tom", "nick", "juli", "neha"], 
        "Age": [10, 15, 14, None]
        }
  
# Create the pandas DataFrame 
df = pd.DataFrame(data) 
  
# print dataframe. 
df

Unnamed: 0,Name,Age
0,tom,10.0
1,nick,15.0
2,juli,14.0
3,neha,


#read and write data:

1. From/To CSV:

In [None]:
df = pd.read_csv('filename.csv', header=True)

df.to_csv('filename.csv', index=False)

2. From/To JSON:

In [None]:
df = pd.read_json('filename.csv', header=True)

df.to_json('filename.csv', index=False)

- can also read/write from/to -excel, -sql database, etc. 

#Other simple Pandas operations:

In [7]:
#view the dataframe 

df.head(2)


Unnamed: 0,Name,Age
0,tom,10
1,nick,15


In [8]:
df.tail(3)

Unnamed: 0,Name,Age
0,tom,10
1,nick,15
2,juli,14


In [9]:
df.sample(2)

Unnamed: 0,Name,Age
2,juli,14
0,tom,10


In [10]:
df.shape

(3, 2)

In [11]:
df.dtypes

Name    object
Age      int64
dtype: object

In [12]:
df.columns

Index(['Name', 'Age'], dtype='object')

In [13]:
df.index

RangeIndex(start=0, stop=3, step=1)

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Name    4 non-null      object 
 1   Age     3 non-null      float64
dtypes: float64(1), object(1)
memory usage: 192.0+ bytes


In [17]:
df.isnull()

Unnamed: 0,Name,Age
0,False,False
1,False,False
2,False,False
3,False,True


In [18]:
df.isnull().sum()

Name    0
Age     1
dtype: int64

In [20]:
df = df.dropna()

df

Unnamed: 0,Name,Age
0,tom,10.0
1,nick,15.0
2,juli,14.0


In [21]:
#rename column

df.rename(columns={'Name':'name', 'Age':'age'}, inplace=True)

df.columns

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Index(['name', 'age'], dtype='object')

In [None]:
df.columns = ['Name', 'Age']

df.columns

Index(['Name', 'Age'], dtype='object')

In [22]:
#add column

df['City'] = ['DC', 'NYC', 'SF']

df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,name,age,City
0,tom,10.0,DC
1,nick,15.0,NYC
2,juli,14.0,SF


In [None]:
#update column
city = ['Seattle', 'Mumbai', 'London']

df["City"] = city

#add a column through list
df['Profession'] = ['student', 'chef', 'author']

df.head()

Unnamed: 0,Name,Age,City,Profession
0,tom,10,Seattle,student
1,nick,15,Mumbai,chef
2,juli,14,London,author


In [None]:
#combine string cols

df['name-city'] = df['Name'] + "-" + df['City']

df.tail()

Unnamed: 0,Name,Age,City,Profession,name-city
0,tom,10,Seattle,student,tom-Seattle
1,nick,15,Mumbai,chef,nick-Mumbai
2,juli,14,London,author,juli-London


In [26]:
#lambda and apply

#apply: Apply a function along an axis of the DataFrame

df['Age-new'] = df['Age'].apply(lambda x: x+10)

df.head()

Unnamed: 0,Name,Age,Age-new
0,tom,10,20
1,nick,15,25
2,juli,14,24


In [27]:
!pip install gapminder

Collecting gapminder
  Downloading https://files.pythonhosted.org/packages/85/83/57293b277ac2990ea1d3d0439183da8a3466be58174f822c69b02e584863/gapminder-0.1-py3-none-any.whl
Installing collected packages: gapminder
Successfully installed gapminder-0.1


In [28]:
from gapminder import gapminder

gapminder.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.85303
2,Afghanistan,Asia,1962,31.997,10267083,853.10071
3,Afghanistan,Asia,1967,34.02,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106


#Slicing, Selecting and Indexing

In [29]:
gapminder['country'].head()

0    Afghanistan
1    Afghanistan
2    Afghanistan
3    Afghanistan
4    Afghanistan
Name: country, dtype: object

In [None]:
gapminder['country']

0       Afghanistan
1       Afghanistan
2       Afghanistan
3       Afghanistan
4       Afghanistan
           ...     
1699       Zimbabwe
1700       Zimbabwe
1701       Zimbabwe
1702       Zimbabwe
1703       Zimbabwe
Name: country, Length: 1704, dtype: object

In [30]:
#subset the dataframe

#columns subset

gm = gapminder[['country', 'year']]

gm.head()

Unnamed: 0,country,year
0,Afghanistan,1952
1,Afghanistan,1957
2,Afghanistan,1962
3,Afghanistan,1967
4,Afghanistan,1972


In [31]:
#subset rows

gm_afg = gapminder[gapminder['country']=="Afghanistan"]

gm_afg.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.85303
2,Afghanistan,Asia,1962,31.997,10267083,853.10071
3,Afghanistan,Asia,1967,34.02,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106


In [32]:
#subset by both rows and columns

gm_afg_year = gapminder[gapminder['country']=="Afghanistan"][['country','year']]

gm_afg_year

Unnamed: 0,country,year
0,Afghanistan,1952
1,Afghanistan,1957
2,Afghanistan,1962
3,Afghanistan,1967
4,Afghanistan,1972
5,Afghanistan,1977
6,Afghanistan,1982
7,Afghanistan,1987
8,Afghanistan,1992
9,Afghanistan,1997


In [None]:
#subset by both rows and columns and drop duplicates

gm_afg_cont = gapminder[gapminder['country']=="Afghanistan"][['country','continent']]

gm_afg_cont

Unnamed: 0,country,continent
0,Afghanistan,Asia
1,Afghanistan,Asia
2,Afghanistan,Asia
3,Afghanistan,Asia
4,Afghanistan,Asia
5,Afghanistan,Asia
6,Afghanistan,Asia
7,Afghanistan,Asia
8,Afghanistan,Asia
9,Afghanistan,Asia


In [None]:
gm_afg_cont.drop_duplicates()

Unnamed: 0,country,continent
0,Afghanistan,Asia


#Indexing: loc and iloc

- `loc` - locates by name <br>
- `iloc` - locates by numerical index

In [None]:
gapminder.iloc[1:5]

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
1,Afghanistan,Asia,1957,30.332,9240934,820.85303
2,Afghanistan,Asia,1962,31.997,10267083,853.10071
3,Afghanistan,Asia,1967,34.02,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106


In [None]:
gapminder.iloc[1:5, 1:4]

Unnamed: 0,continent,year,lifeExp
1,Asia,1957,30.332
2,Asia,1962,31.997
3,Asia,1967,34.02
4,Asia,1972,36.088


In [None]:
gapminder.loc[:4]

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.85303
2,Afghanistan,Asia,1962,31.997,10267083,853.10071
3,Afghanistan,Asia,1967,34.02,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106


In [None]:
gapminder.loc[:4, "year"]

0    1952
1    1957
2    1962
3    1967
4    1972
Name: year, dtype: int64

In [None]:
gapminder.loc[:4, ["year", "continent"]]

Unnamed: 0,year,continent
0,1952,Asia
1,1957,Asia
2,1962,Asia
3,1967,Asia
4,1972,Asia


In [34]:
gapminder.loc[gapminder['year'] > 2002]

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
11,Afghanistan,Asia,2007,43.828,31889923,974.580338
23,Albania,Europe,2007,76.423,3600523,5937.029526
35,Algeria,Africa,2007,72.301,33333216,6223.367465
47,Angola,Africa,2007,42.731,12420476,4797.231267
59,Argentina,Americas,2007,75.320,40301927,12779.379640
...,...,...,...,...,...,...
1655,Vietnam,Asia,2007,74.249,85262356,2441.576404
1667,West Bank and Gaza,Asia,2007,73.422,4018332,3025.349798
1679,"Yemen, Rep.",Asia,2007,62.698,22211743,2280.769906
1691,Zambia,Africa,2007,42.384,11746035,1271.211593


In [None]:
gapminder.loc[gapminder['year'] > 2002, ['continent']]

Unnamed: 0,continent
11,Asia
23,Europe
35,Africa
47,Africa
59,Americas
...,...
1655,Asia
1667,Asia
1679,Asia
1691,Africa


In [None]:
gapminder.loc[(gapminder['year'] > 2002) & (gapminder['continent']=="Asia")]

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
11,Afghanistan,Asia,2007,43.828,31889923,974.580338
95,Bahrain,Asia,2007,75.635,708573,29796.04834
107,Bangladesh,Asia,2007,64.062,150448339,1391.253792
227,Cambodia,Asia,2007,59.723,14131858,1713.778686
299,China,Asia,2007,72.961,1318683096,4959.114854
671,"Hong Kong, China",Asia,2007,82.208,6980412,39724.97867
707,India,Asia,2007,64.698,1110396331,2452.210407
719,Indonesia,Asia,2007,70.65,223547000,3540.651564
731,Iran,Asia,2007,70.964,69453570,11605.71449
743,Iraq,Asia,2007,59.545,27499638,4471.061906


#group by sort value_counts

In [36]:
#group by

gapminder.groupby(['continent']).size()

continent
Africa      624
Americas    300
Asia        396
Europe      360
Oceania      24
dtype: int64

In [37]:
gapminder.groupby(["continent",'year']).size()

continent  year
Africa     1952    52
           1957    52
           1962    52
           1967    52
           1972    52
           1977    52
           1982    52
           1987    52
           1992    52
           1997    52
           2002    52
           2007    52
Americas   1952    25
           1957    25
           1962    25
           1967    25
           1972    25
           1977    25
           1982    25
           1987    25
           1992    25
           1997    25
           2002    25
           2007    25
Asia       1952    33
           1957    33
           1962    33
           1967    33
           1972    33
           1977    33
           1982    33
           1987    33
           1992    33
           1997    33
           2002    33
           2007    33
Europe     1952    30
           1957    30
           1962    30
           1967    30
           1972    30
           1977    30
           1982    30
           1987    30
           1992 

In [None]:
gapminder.groupby(["continent",'country', 'year']).size()

continent  country      year
Africa     Algeria      1952    1
                        1957    1
                        1962    1
                        1967    1
                        1972    1
                               ..
Oceania    New Zealand  1987    1
                        1992    1
                        1997    1
                        2002    1
                        2007    1
Length: 1704, dtype: int64

In [None]:
gapminder.columns

Index(['country', 'continent', 'year', 'lifeExp', 'pop', 'gdpPercap'], dtype='object')

In [None]:
gapminder.sort_values(by='lifeExp')

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
1292,Rwanda,Africa,1992,23.599,7290203,737.068595
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
552,Gambia,Africa,1952,30.000,284320,485.230659
36,Angola,Africa,1952,30.015,4232095,3520.610273
1344,Sierra Leone,Africa,1952,30.331,2143249,879.787736
...,...,...,...,...,...,...
1487,Switzerland,Europe,2007,81.701,7554661,37506.419070
695,Iceland,Europe,2007,81.757,301931,36180.789190
802,Japan,Asia,2002,82.000,127065841,28604.591900
671,"Hong Kong, China",Asia,2007,82.208,6980412,39724.978670


In [None]:
gapminder.sort_values(by=['lifeExp', 'pop'], ascending=[False, True])

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
803,Japan,Asia,2007,82.603,127467972,31656.068060
671,"Hong Kong, China",Asia,2007,82.208,6980412,39724.978670
802,Japan,Asia,2002,82.000,127065841,28604.591900
695,Iceland,Europe,2007,81.757,301931,36180.789190
1487,Switzerland,Europe,2007,81.701,7554661,37506.419070
...,...,...,...,...,...,...
1344,Sierra Leone,Africa,1952,30.331,2143249,879.787736
36,Angola,Africa,1952,30.015,4232095,3520.610273
552,Gambia,Africa,1952,30.000,284320,485.230659
0,Afghanistan,Asia,1952,28.801,8425333,779.445314


#Method Chaining

1. Using gapminder dataset, find the top 5 countries in Asia in 2007 with highest life expectancy.

In [None]:
df = gapminder[gapminder['continent']=="Asia"]

df_2007 = df[df['year']==2007]



In [38]:
gapminder.loc[(gapminder['continent']=='Asia') & (gapminder['year']==2007)].sort_values(by=['lifeExp'], ascending=False)[['country', 'lifeExp']].head(5)

Unnamed: 0,country,lifeExp
803,Japan,82.603
671,"Hong Kong, China",82.208
767,Israel,80.745
1367,Singapore,79.972
851,"Korea, Rep.",78.623


In [None]:
(gapminder
.loc[(gapminder['continent']=='Asia') & (gapminder['year']==2007)]
.sort_values(by=['lifeExp'], ascending=False)[['country']]
.head(5))

Unnamed: 0,country
803,Japan
671,"Hong Kong, China"
767,Israel
1367,Singapore
851,"Korea, Rep."


# Exercises:

1. Create a dataframe using the following dictionary 'd', capitalize the column names and add a new column which is a made up entirely of "xyz" in all rows. 

In [39]:
d = {
    "fruits": ["apple", "oranges", "pears", "peaches"],
     "seasons": ["winter", "spring", "summer", "fall"],
     "colors": ["red", "pink", "green", "blue"]
     }

In [40]:
df = pd.DataFrame(d)

In [41]:
df

Unnamed: 0,fruits,seasons,colors
0,apple,winter,red
1,oranges,spring,pink
2,pears,summer,green
3,peaches,fall,blue


In [42]:
df.columns = ['Fruits', 'Seasons', "Colors"]

df

Unnamed: 0,Fruits,Seasons,Colors
0,apple,winter,red
1,oranges,spring,pink
2,pears,summer,green
3,peaches,fall,blue


In [43]:
df['xyz'] = ['xyz', 'xyz', 'xyz', 'xyz']

df

Unnamed: 0,Fruits,Seasons,Colors,xyz
0,apple,winter,red,xyz
1,oranges,spring,pink,xyz
2,pears,summer,green,xyz
3,peaches,fall,blue,xyz


In [44]:
df['xyz2'] = df['xyz'].apply(lambda x: x.upper())

df

Unnamed: 0,Fruits,Seasons,Colors,xyz,xyz2
0,apple,winter,red,xyz,XYZ
1,oranges,spring,pink,xyz,XYZ
2,pears,summer,green,xyz,XYZ
3,peaches,fall,blue,xyz,XYZ


2. Using the gapminder dataset display the continents in 2007 in decreasing order. 

In [46]:
gapminder.loc[gapminder['year']==2007].sort_values(by=['continent'])[['continent']].drop_duplicates()

Unnamed: 0,continent
1703,Africa
1199,Americas
1499,Asia
419,Europe
71,Oceania
