# 101 Pandas Exercises for Data Analysis

## Index
#### 71. How to remove rows from a dataframe that are present in another dataframe?
#### 72. How to get the positions where values of two columns match?
#### 73. How to create lags and leads of a column in a dataframe?
#### 74. How to get the frequency of unique values in the entire dataframe?
#### 75. How to split a text column into two separate columns?
#### 76. How do I read a tabular data file into pandas?
#### 77. How do I select a pandas Series from a DataFrame?
#### 78. Why do some pandas commands end with parentheses (and others don't)
#### 79. How do I rename columns in a pandas DataFrame?
#### 80. How do I remove rows and columns from a pandas DataFrame?


## 71. How to remove rows from a dataframe that are present in another dataframe?

In [1]:
import pandas as pd
import numpy as np


In [2]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline


In [7]:
#From df1, remove the rows that are present in df2. All three columns must be the same
df1 = pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 3,
                    'weight': ['high', 'medium', 'low'] * 3,
                    'price': np.random.randint(0, 15, 9)})
df1


Unnamed: 0,fruit,weight,price
0,apple,high,14
1,banana,medium,9
2,orange,low,2
3,apple,high,6
4,banana,medium,13
5,orange,low,3
6,apple,high,7
7,banana,medium,8
8,orange,low,12


In [20]:
df_new1 = df1.groupby(df1['fruit']).aggregate({'price': 'sum', 'weight': 'first'})
df_new1


Unnamed: 0_level_0,price,weight
fruit,Unnamed: 1_level_1,Unnamed: 2_level_1
apple,27,high
banana,30,medium
orange,17,low


In [4]:
df2 = pd.DataFrame({'pazham': ['apple', 'orange', 'pine'] * 2,
                    'kilo': ['high', 'low'] * 3,
                    'price': np.random.randint(0, 15, 6)})
df2


Unnamed: 0,pazham,kilo,price
0,apple,high,10
1,orange,low,1
2,pine,high,5
3,apple,low,14
4,orange,high,7
5,pine,low,12


In [23]:
df_new2 = df2.groupby('pazham').aggregate({'price': 'sum', 'kilo': 'first'})
df_new2


Unnamed: 0_level_0,price,kilo
pazham,Unnamed: 1_level_1,Unnamed: 2_level_1
apple,24,high
orange,8,low
pine,17,high


In [41]:
# Solution
print(df1[~df1.isin(df2).all(1)])


    fruit  weight  price
0   apple    high     14
1  banana  medium      9
2  orange     low      2
3   apple    high      6
4  banana  medium     13
5  orange     low      3
6   apple    high      7
7  banana  medium      8
8  orange     low     12


In [43]:
# Solution 2, Merging
df_new1.merge(df_new2, how = 'outer',left_index= True,right_index=True)


Unnamed: 0,price_x,weight,price_y,kilo
apple,27.0,high,24.0,high
banana,30.0,medium,,
orange,17.0,low,8.0,low
pine,,,17.0,high


## 72. How to get the positions where values of two columns match?

In [45]:
# Input
df = pd.DataFrame({'fruit1': np.random.choice(['apple', 'orange', 'banana'], 10),
                    'fruit2': np.random.choice(['apple', 'orange', 'banana'], 10)})
df


Unnamed: 0,fruit1,fruit2
0,orange,apple
1,banana,orange
2,orange,apple
3,apple,apple
4,banana,apple
5,banana,apple
6,orange,orange
7,apple,apple
8,banana,orange
9,orange,orange


In [46]:
# Solution
np.where(df.fruit1 == df.fruit2)


(array([3, 6, 7, 9], dtype=int64),)

## 73. How to create lags and leads of a column in a dataframe?

In [48]:
# Create two new columns in df, one of which is a lag1 (shift column a down by 1 row) of column ‘a’ 
# and the other is a lead1 (shift column b up by 1 row).

df = pd.DataFrame(np.random.randint(1, 100, 20).reshape(-1, 4), columns = list('abcd'))
df


Unnamed: 0,a,b,c,d
0,52,42,53,91
1,66,54,7,40
2,69,44,29,49
3,69,47,31,75
4,50,37,20,16


In [49]:
# Solution
df['a_lag1'] = df['a'].shift(1)
df['b_lead1'] = df['b'].shift(-1)
print(df)


    a   b   c   d  a_lag1  b_lead1
0  52  42  53  91     NaN     54.0
1  66  54   7  40    52.0     44.0
2  69  44  29  49    66.0     47.0
3  69  47  31  75    69.0     37.0
4  50  37  20  16    69.0      NaN


## 74. How to get the frequency of unique values in the entire dataframe?

In [62]:
# Input
df = pd.DataFrame(np.random.randint(1, 10, 20).reshape(-1, 4), columns = list('abcd'))
df


Unnamed: 0,a,b,c,d
0,6,3,1,1
1,4,1,1,1
2,3,6,7,7
3,6,4,9,8
4,5,9,8,4


In [63]:
# Solution
pd.value_counts(df.values.ravel())


1    5
6    3
4    3
9    2
8    2
7    2
3    2
5    1
dtype: int64

## 75. How to split a text column into two separate columns?

In [64]:
df = pd.DataFrame(["STD, City    State",
"33, Kolkata    West Bengal",
"44, Chennai    Tamil Nadu",
"40, Hyderabad    Telengana",
"80, Bangalore    Karnataka"], columns=['row'])

print(df)


                          row
0          STD, City    State
1  33, Kolkata    West Bengal
2   44, Chennai    Tamil Nadu
3  40, Hyderabad    Telengana
4  80, Bangalore    Karnataka


In [73]:
# Solution
df_out = df.row.str.split(',|\t', expand=True)
df_out


Unnamed: 0,0,1
0,STD,City State
1,33,Kolkata West Bengal
2,44,Chennai Tamil Nadu
3,40,Hyderabad Telengana
4,80,Bangalore Karnataka


In [74]:
# Make first row as header
new_header = df_out.iloc[0]
df_out = df_out[1:]
df_out.columns = new_header
print(df_out)


0 STD            City    State
1  33   Kolkata    West Bengal
2  44    Chennai    Tamil Nadu
3  40   Hyderabad    Telengana
4  80   Bangalore    Karnataka


## 76. How do I read a tabular data file into pandas?

In [76]:
# read a dataset of Chipotle orders and store the results in a DataFrame
orders = pd.read_table('chipotle.tsv')
orders.head()


Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98


In [79]:
# read a dataset of movie reviewers (modifying the default parameter values for read_table)

user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']

users = pd.read_table('u.user', sep='|', header=None, names=user_cols)
users.head()


Unnamed: 0,user_id,age,gender,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


## 77. How do I select a pandas Series from a DataFrame? 

In [80]:
# read a dataset of UFO reports into a DataFrame

ufo = pd.read_table('ufo.csv', sep=',')
ufo.head()


Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00
3,Abilene,,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00


In [81]:
# read_csv is equivalent to read_table, except it assumes a comma separator

ufo = pd.read_csv('ufo.csv')
ufo.head()


Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00
3,Abilene,,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00


In [82]:
# select the 'City' Series using bracket notation
ufo['City'].head()


0                  Ithaca
1             Willingboro
2                 Holyoke
3                 Abilene
4    New York Worlds Fair
Name: City, dtype: object

In [83]:
# or equivalently, use dot notation
ufo.City.head()


0                  Ithaca
1             Willingboro
2                 Holyoke
3                 Abilene
4    New York Worlds Fair
Name: City, dtype: object

In [85]:
# create a new 'Location' Series (must use bracket notation to define the Series name)
ufo['Location'] = ufo['City']+ ', ' + ufo['State']
ufo.head()


Unnamed: 0,City,Colors Reported,Shape Reported,State,Time,Location
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00,"Ithaca, NY"
1,Willingboro,,OTHER,NJ,6/30/1930 20:00,"Willingboro, NJ"
2,Holyoke,,OVAL,CO,2/15/1931 14:00,"Holyoke, CO"
3,Abilene,,DISK,KS,6/1/1931 13:00,"Abilene, KS"
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00,"New York Worlds Fair, NY"


## 78. Why do some pandas commands end with parentheses (and others don't)

In [87]:
# read a dataset of top-rated IMDb movies into a DataFrame

movies = pd.read_csv('imdb_1000.csv')
movies.head()


Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."


In [88]:
# example method: calculate summary statistics
movies.describe().T


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
star_rating,979.0,7.889785,0.336069,7.4,7.6,7.8,8.1,9.3
duration,979.0,120.979571,26.21801,64.0,102.0,117.0,134.0,242.0


In [89]:
# example attribute: number of rows and columns
movies.shape


(979, 6)

In [90]:
# example attribute: data type of each column
movies.dtypes


star_rating       float64
title              object
content_rating     object
genre              object
duration            int64
actors_list        object
dtype: object

In [92]:
# use an optional parameter to the describe method to summarize only 'object' columns
movies.describe(include=['object']).T


Unnamed: 0,count,unique,top,freq
title,979,975,True Grit,2
content_rating,976,12,R,460
genre,979,16,Drama,278
actors_list,979,969,"[u'Daniel Radcliffe', u'Emma Watson', u'Rupert...",6


## 79. How do I rename columns in a pandas DataFrame?

In [93]:
# read_csv 

ufo = pd.read_csv('ufo.csv')
ufo.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00
3,Abilene,,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00


In [94]:
# examine the column names
ufo.columns

Index(['City', 'Colors Reported', 'Shape Reported', 'State', 'Time'], dtype='object')

In [95]:
# rename two of the columns by using the 'rename' method
ufo.rename(columns={'Colors Reported':'Colors_Reported', 'Shape Reported':'Shape_Reported'}, inplace=True)
ufo.columns

Index(['City', 'Colors_Reported', 'Shape_Reported', 'State', 'Time'], dtype='object')

In [96]:
# replace all of the column names by overwriting the 'columns' attribute
ufo_cols = ['city', 'colors reported', 'shape reported', 'state', 'time']
ufo.columns = ufo_cols
ufo.columns

Index(['city', 'colors reported', 'shape reported', 'state', 'time'], dtype='object')

In [97]:
# read_csv 
ufo_cols = ['city', 'colors reported', 'shape reported', 'state', 'time']
ufo = pd.read_csv('ufo.csv', header=0, names=ufo_cols)
ufo.head()

Unnamed: 0,city,colors reported,shape reported,state,time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00
3,Abilene,,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00


In [98]:
# replace all spaces with underscores in the column names by using the 'str.replace' method
ufo.columns = ufo.columns.str.replace(' ', '_')
ufo.columns

Index(['city', 'colors_reported', 'shape_reported', 'state', 'time'], dtype='object')

## 80. How do I remove rows and columns from a pandas DataFrame?

In [99]:
# read_csv 
ufo = pd.read_csv('ufo.csv')
ufo.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00
3,Abilene,,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00


In [100]:
# remove a single column (axis=1 refers to columns)
ufo.drop('Colors Reported', axis=1, inplace=True)
ufo.head()

Unnamed: 0,City,Shape Reported,State,Time
0,Ithaca,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,OTHER,NJ,6/30/1930 20:00
2,Holyoke,OVAL,CO,2/15/1931 14:00
3,Abilene,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,LIGHT,NY,4/18/1933 19:00


In [101]:
# remove multiple rows at once (axis=0 refers to rows)
ufo.drop([0, 1], axis=0, inplace=True)
ufo.head()

Unnamed: 0,City,Shape Reported,State,Time
2,Holyoke,OVAL,CO,2/15/1931 14:00
3,Abilene,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,LIGHT,NY,4/18/1933 19:00
5,Valley City,DISK,ND,9/15/1934 15:30
6,Crater Lake,CIRCLE,CA,6/15/1935 0:00
