# Pandas
<li>Pandas is an open-source Python package that is built on top of NumPy used for working with data sets.</li> 
<li>The name "Pandas" has a reference to <b>"Python Data Analysis".</b></li>
<li>Pandas is considered to be one of the best data-wrangling packages.</li>
<li>Pandas offers user-friendly, easy-to-use data structures and analysis tools for analyzing, cleaning, exploring and manipulating data.</li>
<li>It also functions well with various other data science Python modules.</li>


# Difference Between NumPy & Pandas

![](images/pandas_vs_numpy.png)

## Why Use Pandas?

<li>Pandas is known for its exceptional ability to represent and organize data.</li>
<li>The Pandas library was created to be able to work with large datasets faster and more efficiently than any other library.</li>
<li>It excels at analyzing huge amounts of data.Pandas allows us to analyze big data and make conclusions based on statistical theories.</li>
<li>Pandas can clean messy data sets, and make them readable and relevant.</li>
<li>By combining the functionality of Matplotlib and NumPy, Pandas offers users a powerful tool for performing <b>data analytics and visualization.</b></li>
<li>Data can be imported to Pandas from a variety of file formats, such as Csv, SQL, Excel, and JSON, among others.</li>
<li>Pandas is a versatile and marketable skill set for data analysts and data scientists that can gain the attention of employers.</li>


## Installation Of Pandas
<li>Go to your terminal, open and activate your virtual environment and then use the following commands for installing pandas.</li>

<code>
    pip install pandas
</code>

## Importing Pandas
<li>We need to import pandas if we want to create a pandas dataframe and perform any analysis on them.</li>
<li>We can import pandas package using the following command:</li>
<code>
    import pandas as pd
</code>

In [303]:
import pandas as pd

## How To Create A Pandas DataFrame
<li>A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, arranged in a table like structure with rows and columns.</li>
<li>We can create a basic pandas dataframe by various methods.</li>
<li>Let's discuss some of the methods to create the given dataframes:</li>

![](images/dataframe.png)

### 1. From Python Dictionary

In [304]:
dict = {
    "name": ["Hari", "Sita", "Gita"],
    "age": [34, 22, 44]
}
pd.DataFrame(dict)

Unnamed: 0,name,age
0,Hari,34
1,Sita,22
2,Gita,44


### 2. From a list of dictionaries

In [305]:
list_dict = [
    {"name": "Hari", "age": 34},
    {"name": "Sita", "age": 33}
]
pd.DataFrame(list_dict)

Unnamed: 0,name,age
0,Hari,34
1,Sita,33


### 3. From a list of tuples

In [306]:
list_tuple = [("Hari", 34), ("Ram", 22)]
pd.DataFrame(list_tuple, columns=["Name", "Age"])

Unnamed: 0,Name,Age
0,Hari,34
1,Ram,22


### 4. From list of lists

In [307]:
list_list = [["Hari", 22], ["Mohan", 24]]
pd.DataFrame(list_list, columns=["Name", "Age"])

Unnamed: 0,Name,Age
0,Hari,22
1,Mohan,24


#### Question:
<li>Read 'weather_data.csv' file using csv reader.</li>
<li>Store the data inside the csv file into a list of lists.</li>
<li>Then create a pandas dataframe using list of list.</li>

In [308]:
import csv
with open("data/weather_data.csv") as file:
    data = csv.reader(file)
    list_data = list(data)
    list_of = [list_data[index] for index in range(3, len(list_data))]
    df = pd.DataFrame(list_of, columns = ["Date", "Temperature", "Windspeed", "Event"])

df

Unnamed: 0,Date,Temperature,Windspeed,Event
0,1/1/2017,32,6,Rain
1,1/4/2017,not available,9,Sunny
2,1/5/2017,-1,not measured,Snow
3,1/6/2017,not available,7,no event
4,1/7/2017,32,not measured,Rain
5,1/8/2017,not available,not measured,Sunny
6,1/9/2017,not available,not measured,no event
7,1/10/2017,34,8,Cloudy
8,1/11/2017,-4,-1,Snow
9,1/12/2017,26,12,Sunny


#### Question
<li>1. Read 'imports-85.data' file using file reader.</li>
<li>2. Store the data present inside the file into a list of list.</li>
<li>3. Create a pandas dataframe using list of lists.</li>
<li>4. For column name, we can use the columns variable given below.</li>

In [309]:
with open("data/imports-85.data") as file:
    data = csv.reader(file)
    data = list(data)

In [310]:
columns = ['symboling', 'normalized_losses', 'make', 'fuel_type', 'aspiration', 'num_of_doors',
          'body_style', 'drive_wheels', 'engine_location', 'wheel_base', 'length', 'width', 
           'height', 'curb_weight', 'engine_type', 'num_of_cylinders', 'engine_size', 'fuel_system',
          'bore', 'stroke', 'compression', 'horsepower', 'peak_rpm', 'city_mpg', 'highway_mpg', 
           'price']
import_df = pd.DataFrame(data, columns=columns)
import_df.head()

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_of_doors,body_style,drive_wheels,engine_location,wheel_base,...,engine_size,fuel_system,bore,stroke,compression,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


### 5. Pandas Dataframe From Csv files

<li>We can load a csv file and create a dataframe out of the data present inside a csv file using pandas.</li>
<li>We have <b>.read_csv()</b> method to read a csv file and create a pandas dataframe from the dataset.</li>

In [311]:
weather_df = pd.read_csv("data/weather_data.csv", names=["Date", "Temperature", "Windspeed", "Event"])
weather_df

Unnamed: 0,Date,Temperature,Windspeed,Event
0,kfjkdfjskd,,,
1,dfuhsdjufio,,,
2,day,temperature,windspeed,event
3,1/1/2017,32,6,Rain
4,1/4/2017,not available,9,Sunny
5,1/5/2017,-1,not measured,Snow
6,1/6/2017,not available,7,no event
7,1/7/2017,32,not measured,Rain
8,1/8/2017,not available,not measured,Sunny
9,1/9/2017,not available,not measured,no event


### Reading a csv file using skiprows and header parameters

In [312]:
weather_df = pd.read_csv("data/weather_data.csv", skiprows=3, header=None)
weather_df

Unnamed: 0,0,1,2,3
0,1/1/2017,32,6,Rain
1,1/4/2017,not available,9,Sunny
2,1/5/2017,-1,not measured,Snow
3,1/6/2017,not available,7,no event
4,1/7/2017,32,not measured,Rain
5,1/8/2017,not available,not measured,Sunny
6,1/9/2017,not available,not measured,no event
7,1/10/2017,34,8,Cloudy
8,1/11/2017,-4,-1,Snow
9,1/12/2017,26,12,Sunny


#### Reading a csv file without header and giving names to the columns

In [313]:
weather_df = pd.read_csv("data/weather_data.csv", skiprows=3, header=None, names=["Date", "Temperature", "Windspeed", "Event"])
weather_df

Unnamed: 0,Date,Temperature,Windspeed,Event
0,1/1/2017,32,6,Rain
1,1/4/2017,not available,9,Sunny
2,1/5/2017,-1,not measured,Snow
3,1/6/2017,not available,7,no event
4,1/7/2017,32,not measured,Rain
5,1/8/2017,not available,not measured,Sunny
6,1/9/2017,not available,not measured,no event
7,1/10/2017,34,8,Cloudy
8,1/11/2017,-4,-1,Snow
9,1/12/2017,26,12,Sunny


#### Read limited data from a csv file using nrows parameters


In [314]:
weather_df = pd.read_csv("data/weather_data.csv", skiprows=2, nrows=4) # 4 oota rows matra dinxa
weather_df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/4/2017,not available,9,Sunny
2,1/5/2017,-1,not measured,Snow
3,1/6/2017,not available,7,no event


#### Reading csv files with na_values parameters ('weather_data.csv' file)


In [315]:
weather_df = pd.read_csv("data/weather_data.csv", skiprows=2, na_values=["not available", "not measured", "no event"])
weather_df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32.0,6.0,Rain
1,1/4/2017,,9.0,Sunny
2,1/5/2017,-1.0,,Snow
3,1/6/2017,,7.0,
4,1/7/2017,32.0,,Rain
5,1/8/2017,,,Sunny
6,1/9/2017,,,
7,1/10/2017,34.0,8.0,Cloudy
8,1/11/2017,-4.0,-1.0,Snow
9,1/12/2017,26.0,12.0,Sunny


#### Write a pandas dataframe to a csv file
<li>We can write a pandas dataframe to a csv file using .to_csv() method.</li>
<li>You can specify any name to the csv file while writing a pandas dataframe into a csv file.</li>

In [316]:
to_csv_file = weather_df.to_csv("r_w_data/weather_nan_to.csv") 
to_csv_file

### 6. Pandas Dataframe From Xcel files

<li>We can load an excel file with <b>.xlsx</b> extension and create a dataframe out of the data present inside an excel file using pandas.</li>
<li>We have <b>.read_excel()</b> method to read a csv file and create a pandas dataframe from the dataset.</li>
<li>We also need to install <b>openpyxl</b> for working with excel files.</li>


# nans

In [317]:
# !pip install openpyxl

In [318]:
# Importing openpyxl (engine for loading excel file)
# import openpyxl

# # opening the excel file using openpyxl.load_workbook
# wb = openpyxl.load_workbook("data/weather_data.xlsx")

# # accessing the specific sheet in the excel file
# sheet = wb["nans"]

# # iterating the rows of the file and making list of tuples
# data_to_df = [row for row in sheet.iter_rows(values_only=True)]
# excel_df = pd.DataFrame(data_to_df)
# excel_df.drop(columns=0, inplace=True)
# excel_df.columns = ["day", "temperature", "windspeed", "event"]
# excel_df.drop(labels=0, axis=0, inplace=True)
# excel_df.reset_index(drop=True, inplace=True)
# excel_df

In [319]:
# df = pd.read_excel("data/weather_data.xlsx", sheet_name="nans")
# df.drop(columns="Unnamed: 0", inplace=True)
# df

#### Writing to an excel file
<li>We can write a pandas dataframe into a excel file using .to_excel() method.</li>

In [320]:
# df.to_excel("r_w_data/df_to_excel.xlsx")

#### Using head() and tail() method to see top 5 and last 5 rows
<li>To view the first few rows of our dataframe, we can use the DataFrame.head() method.</li>
<li>By default, it returns the first five rows of our dataframe.</li>
<li>However, it also accepts an optional integer parameter, which specifies the number of rows.</li>

<li>Similarly, to view the last few rows of our dataframe, we can use the DataFrame.tail() method.</li>
<li>By default, it returns the last five rows of our dataframe.</li>
<li>However, it also accepts an optional integer parameter, which specifies the number of rows.</li>

#### Question:

<li>Use the head() method to select the first 6 rows.</li>
<li>Use the tail() method to select the last 8 rows.</li>

In [321]:
df.head(6)

Unnamed: 0,Date,Temperature,Windspeed,Event
0,1/1/2017,32,6,Rain
1,1/4/2017,not available,9,Sunny
2,1/5/2017,-1,not measured,Snow
3,1/6/2017,not available,7,no event
4,1/7/2017,32,not measured,Rain
5,1/8/2017,not available,not measured,Sunny


In [322]:
df.tail(8)

Unnamed: 0,Date,Temperature,Windspeed,Event
5,1/8/2017,not available,not measured,Sunny
6,1/9/2017,not available,not measured,no event
7,1/10/2017,34,8,Cloudy
8,1/11/2017,-4,-1,Snow
9,1/12/2017,26,12,Sunny
10,1/13/2017,12,12,Rainy
11,1/11/2017,-1,12,Snow
12,1/14/2017,40,-1,Sunny


#### Finding the column names from the dataframe
<li>We have df.columns attributes to check the name of columns in the pandas dataframe.</li>
<li>Similarly, we have df.values attributes to check the data present in the pandas dataframe.</li>

In [323]:
# finding the columns name wala chai ho
# not available xa ki xaina
# .columns
# .type
# .values
# size
# sunny xa ki xaina values ko adhar ma
# not measured, no event, not available

In [324]:
df = pd.read_csv("data/weather_data.csv", skiprows=2)
df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/4/2017,not available,9,Sunny
2,1/5/2017,-1,not measured,Snow
3,1/6/2017,not available,7,no event
4,1/7/2017,32,not measured,Rain
5,1/8/2017,not available,not measured,Sunny
6,1/9/2017,not available,not measured,no event
7,1/10/2017,34,8,Cloudy
8,1/11/2017,-4,-1,Snow
9,1/12/2017,26,12,Sunny


In [325]:
type(df)

pandas.core.frame.DataFrame

In [326]:
df.columns

Index(['day', 'temperature', 'windspeed', 'event'], dtype='object')

In [327]:
df.values

array([['1/1/2017', '32', '6', 'Rain'],
       ['1/4/2017', 'not available', '9', 'Sunny'],
       ['1/5/2017', '-1', 'not measured', 'Snow'],
       ['1/6/2017', 'not available', '7', 'no event'],
       ['1/7/2017', '32', 'not measured', 'Rain'],
       ['1/8/2017', 'not available', 'not measured', 'Sunny'],
       ['1/9/2017', 'not available', 'not measured', 'no event'],
       ['1/10/2017', '34', '8', 'Cloudy'],
       ['1/11/2017', '-4', '-1', 'Snow'],
       ['1/12/2017', '26', '12', 'Sunny'],
       ['1/13/2017', '12', '12', 'Rainy'],
       ['1/11/2017', '-1', '12', 'Snow'],
       ['1/14/2017', '40', '-1', 'Sunny']], dtype=object)

In [328]:
df.shape

(13, 4)

In [329]:
df.size # no_of_rows * no_of_cols

52

In [330]:
df.values == "not measured"

array([[False, False, False, False],
       [False, False, False, False],
       [False, False,  True, False],
       [False, False, False, False],
       [False, False,  True, False],
       [False, False,  True, False],
       [False, False,  True, False],
       [False, False, False, False],
       [False, False, False, False],
       [False, False, False, False],
       [False, False, False, False],
       [False, False, False, False],
       [False, False, False, False]])

In [331]:
df["temperature"] == "not measured"

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
Name: temperature, dtype: bool

In [332]:
df[(df["event"] == "no event")]

Unnamed: 0,day,temperature,windspeed,event
3,1/6/2017,not available,7,no event
6,1/9/2017,not available,not measured,no event


In [333]:
df[df["windspeed"] == "not measured"]

Unnamed: 0,day,temperature,windspeed,event
2,1/5/2017,-1,not measured,Snow
4,1/7/2017,32,not measured,Rain
5,1/8/2017,not available,not measured,Sunny
6,1/9/2017,not available,not measured,no event


In [334]:
df[(df["temperature"] == "not available") & (df["windspeed"] == "not measured") & (df["event"] == "no event")]

Unnamed: 0,day,temperature,windspeed,event
6,1/9/2017,not available,not measured,no event


#### Checking the type of your dataframe 
<li>Another feature that makes pandas better for working with data is that dataframes can contain more than one data type.</li>
<li>Axis values can have string labels, not just numeric ones.</li>
<li>Dataframes can contain columns with multiple data types: including integer, float, and string.</li>
<li>We can use the DataFrame.dtypes attribute (similar to NumPy) to return information about the types of each column.</li>
<li>When we import data, pandas attempts to guess the correct dtype for each column.</li>
<li>Generally, pandas does well with this, which means we don't need to worry about specifying dtypes every time we start to work with data.</li>



In [335]:
df.dtypes

day            object
temperature    object
windspeed      object
event          object
dtype: object

In [336]:
print(weather_df.dtypes)

day             object
temperature    float64
windspeed      float64
event           object
dtype: object


#### Datatypes Information
<li>We can get the shape of the dataset using <b>.shape()</b> method.</li>
<li><b>.shape()</b> method returns the tuple datatype containing the number of rows and number of columns in the dataset.</li>
<li>If we wanted an overview of all the dtypes used in our dataframe, we can use <b>.info()</b> method.</li>
<li>Note that <b>DataFrame.info()</b> prints the information, rather than returning it, so we can't assign it to a variable.</li>


In [337]:
df.shape

(13, 4)

In [338]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13 entries, 0 to 12
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   day          13 non-null     object
 1   temperature  13 non-null     object
 2   windspeed    13 non-null     object
 3   event        13 non-null     object
dtypes: object(4)
memory usage: 548.0+ bytes


In [339]:
weather_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13 entries, 0 to 12
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   day          13 non-null     object 
 1   temperature  9 non-null      float64
 2   windspeed    9 non-null      float64
 3   event        11 non-null     object 
dtypes: float64(2), object(2)
memory usage: 548.0+ bytes


#### Checking the null values in the pandas dataframe

In [340]:
weather_df.isna()

Unnamed: 0,day,temperature,windspeed,event
0,False,False,False,False
1,False,True,False,False
2,False,False,True,False
3,False,True,False,True
4,False,False,True,False
5,False,True,True,False
6,False,True,True,True
7,False,False,False,False
8,False,False,False,False
9,False,False,False,False


In [341]:
weather_df.isna().sum()

day            0
temperature    4
windspeed      4
event          2
dtype: int64

#### set_index() and reset_index() method

In [342]:
weather_df.set_index(keys="temperature", inplace=True)


In [343]:
weather_df

Unnamed: 0_level_0,day,windspeed,event
temperature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
32.0,1/1/2017,6.0,Rain
,1/4/2017,9.0,Sunny
-1.0,1/5/2017,,Snow
,1/6/2017,7.0,
32.0,1/7/2017,,Rain
,1/8/2017,,Sunny
,1/9/2017,,
34.0,1/10/2017,8.0,Cloudy
-4.0,1/11/2017,-1.0,Snow
26.0,1/12/2017,12.0,Sunny


In [344]:
weather_df.reset_index(inplace=True)
weather_df

Unnamed: 0,temperature,day,windspeed,event
0,32.0,1/1/2017,6.0,Rain
1,,1/4/2017,9.0,Sunny
2,-1.0,1/5/2017,,Snow
3,,1/6/2017,7.0,
4,32.0,1/7/2017,,Rain
5,,1/8/2017,,Sunny
6,,1/9/2017,,
7,34.0,1/10/2017,8.0,Cloudy
8,-4.0,1/11/2017,-1.0,Snow
9,26.0,1/12/2017,12.0,Sunny


#### Selecting a column from a pandas DataFrame

<li>Since our axis in pandas have labels, we can select data using those labels.</li> 
<li>Unlike in NumPy, we donot need to know the exact index location of a pandas dataframe.</li>
<li>To do this, we can use the DataFrame.loc[] attribute. The syntax for DataFrame.loc[] is:</li>
<code>
df.loc[row_label, column_label]
</code>

<li>We can use the following shortcut to select a single column:</li>
<code>
df["column_name"]
</code>

<li>This style of selecting columns is very common.</li>


In [345]:
weather_df.head(3)

Unnamed: 0,temperature,day,windspeed,event
0,32.0,1/1/2017,6.0,Rain
1,,1/4/2017,9.0,Sunny
2,-1.0,1/5/2017,,Snow


In [346]:
weather_df["temperature"]

0     32.0
1      NaN
2     -1.0
3      NaN
4     32.0
5      NaN
6      NaN
7     34.0
8     -4.0
9     26.0
10    12.0
11    -1.0
12    40.0
Name: temperature, dtype: float64

In [347]:
weather_df["event"]

0       Rain
1      Sunny
2       Snow
3        NaN
4       Rain
5      Sunny
6        NaN
7     Cloudy
8       Snow
9      Sunny
10     Rainy
11      Snow
12     Sunny
Name: event, dtype: object

#### Questions

<li>Read <b>'appointment_schedule.csv'</b> file using pandas.</li>
<li>Select the <b>'name'</b> column from the given dataset and store to <b>'appointment_names'</b> variable.</li>
<li>Use Python's <b>type()</b> function to assign the type of name column to <b>name_type</b>.</li>

In [348]:
schedule_df = pd.read_csv("data/appointment_schedule.csv")
schedule_df.head(2)

Unnamed: 0,name,appointment_made_date,app_start_date,app_end_date,visitee_namelast,visitee_namefirst,meeting_room,description
0,Joshua T. Blanton,2014-12-18T00:00:00,1/6/15 9:30,1/6/15 23:59,,potus,west wing,JointService Military Honor Guard
1,Jack T. Gutting,2014-12-18T00:00:00,1/6/15 9:30,1/6/15 23:59,,potus,west wing,JointService Military Honor Guard


In [349]:
appointment_names = schedule_df["name"]
appointment_names

0        Joshua T. Blanton
1          Jack T. Gutting
2        Bradley T. Guiles
3           Loryn F. Grieb
4         Travis D. Gordon
              ...         
580         Ryan J. Morgan
581    Alexander V. Nevsky
582     Montana J. Johnson
583    Joseph A. Pritchard
584        Martin O. Reina
Name: name, Length: 585, dtype: object

In [350]:
# type, shape
type(appointment_names)

pandas.core.series.Series

In [351]:
appointment_names.shape

(585,)

#### Pandas Series
<li>Series is the pandas type for one-dimensional objects.</li>
<li>Anytime you see a 1D pandas object, it will be a series. Anytime you see a 2D pandas object, it will be a dataframe.</li>
<li>A dataframe is a collection of series objects, which is similar to how pandas stores the data behind the scenes.</li>

#### Adding a column in a pandas dataframe

In [352]:
weather_df = pd.read_csv("r_w_data/weather_nan_to.csv")
weather_df = weather_df.drop(columns="Unnamed: 0")
weather_df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32.0,6.0,Rain
1,1/4/2017,,9.0,Sunny
2,1/5/2017,-1.0,,Snow
3,1/6/2017,,7.0,
4,1/7/2017,32.0,,Rain
5,1/8/2017,,,Sunny
6,1/9/2017,,,
7,1/10/2017,34.0,8.0,Cloudy
8,1/11/2017,-4.0,-1.0,Snow
9,1/12/2017,26.0,12.0,Sunny


In [353]:
weather_df["Holiday"] = [True, True, False, False, False, True, True, True, False, False, False, False, False]
weather_df

Unnamed: 0,day,temperature,windspeed,event,Holiday
0,1/1/2017,32.0,6.0,Rain,True
1,1/4/2017,,9.0,Sunny,True
2,1/5/2017,-1.0,,Snow,False
3,1/6/2017,,7.0,,False
4,1/7/2017,32.0,,Rain,False
5,1/8/2017,,,Sunny,True
6,1/9/2017,,,,True
7,1/10/2017,34.0,8.0,Cloudy,True
8,1/11/2017,-4.0,-1.0,Snow,False
9,1/12/2017,26.0,12.0,Sunny,False


In [354]:
time = pd.Series([3, 5, 2, 5, 2, 5, 2, 5, 2, 5])

In [355]:
weather_df.insert(loc=1, column="Time", value=time)

In [356]:
weather_df

Unnamed: 0,day,Time,temperature,windspeed,event,Holiday
0,1/1/2017,3.0,32.0,6.0,Rain,True
1,1/4/2017,5.0,,9.0,Sunny,True
2,1/5/2017,2.0,-1.0,,Snow,False
3,1/6/2017,5.0,,7.0,,False
4,1/7/2017,2.0,32.0,,Rain,False
5,1/8/2017,5.0,,,Sunny,True
6,1/9/2017,2.0,,,,True
7,1/10/2017,5.0,34.0,8.0,Cloudy,True
8,1/11/2017,2.0,-4.0,-1.0,Snow,False
9,1/12/2017,5.0,26.0,12.0,Sunny,False


In [357]:
weather_df.loc[:, "Demo"] = list("lkdflfjddddkd")
weather_df.head(2)

Unnamed: 0,day,Time,temperature,windspeed,event,Holiday,Demo
0,1/1/2017,3.0,32.0,6.0,Rain,True,l
1,1/4/2017,5.0,,9.0,Sunny,True,k


### Selecting Multiple Columns From the DataFrame

![](images/selecting_columns.png)

<li>We can select multiple columns from the dataframe by using the following codes:</li>
<code>
    df.loc[:, ["col1", "col2"]]
</code>

<li>We can use syntax shortcuts for selecting multiple columns by using the following syntax:</li>
<code>
    df[["col1", "col2"]]
</code>

In [358]:
car_details_df = pd.read_csv('data/car_details.csv')
car_details_df.head()

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner
0,Maruti 800 AC,2007,60000,70000,Petrol,Individual,Manual,First Owner
1,Maruti Wagon R LXI Minor,2007,135000,50000,Petrol,Individual,Manual,First Owner
2,Hyundai Verna 1.6 SX,2012,600000,100000,Diesel,Individual,Manual,First Owner
3,Datsun RediGO T Option,2017,250000,46000,Petrol,Individual,Manual,First Owner
4,Honda Amaze VX i-DTEC,2014,450000,141000,Diesel,Individual,Manual,Second Owner


In [359]:
car_details_df.loc[:, ['name', 'selling_price', 'km_driven']].head()

Unnamed: 0,name,selling_price,km_driven
0,Maruti 800 AC,60000,70000
1,Maruti Wagon R LXI Minor,135000,50000
2,Hyundai Verna 1.6 SX,600000,100000
3,Datsun RediGO T Option,250000,46000
4,Honda Amaze VX i-DTEC,450000,141000


In [360]:
car_details_df[['name', 'selling_price', 'km_driven']].head()

Unnamed: 0,name,selling_price,km_driven
0,Maruti 800 AC,60000,70000
1,Maruti Wagon R LXI Minor,135000,50000
2,Hyundai Verna 1.6 SX,600000,100000
3,Datsun RediGO T Option,250000,46000
4,Honda Amaze VX i-DTEC,450000,141000


In [361]:
car_details_limited = car_details_df.drop(['year', 'fuel', 'seller_type',
                                          'transmission', 'owner'],
                                          axis = 1)
car_details_limited.head()

Unnamed: 0,name,selling_price,km_driven
0,Maruti 800 AC,60000,70000
1,Maruti Wagon R LXI Minor,135000,50000
2,Hyundai Verna 1.6 SX,600000,100000
3,Datsun RediGO T Option,250000,46000
4,Honda Amaze VX i-DTEC,450000,141000


In [362]:
car_details_limited["km_driven"].dtypes

dtype('int64')

#### Selecting Rows From A Pandas DataFrame

<li>Now that we've learned how to select columns by label, let's learn how to select rows using the labels of the index axis.</li>
<li>We can use the same syntax to select rows from a dataframe as we do for columns:</li>
<code>
    df.loc[row_label, column_label]
</code>

![](images/selecting_one_row.png)

In [363]:
weather_df.loc[2]

day            1/5/2017
Time                2.0
temperature        -1.0
windspeed           NaN
event              Snow
Holiday           False
Demo                  d
Name: 2, dtype: object

### Selecting Multiple Rows From the DataFrame

![](images/selecting_multiple_rows.png)

In [364]:
weather_df.loc[:4]

Unnamed: 0,day,Time,temperature,windspeed,event,Holiday,Demo
0,1/1/2017,3.0,32.0,6.0,Rain,True,l
1,1/4/2017,5.0,,9.0,Sunny,True,k
2,1/5/2017,2.0,-1.0,,Snow,False,d
3,1/6/2017,5.0,,7.0,,False,f
4,1/7/2017,2.0,32.0,,Rain,False,l


In [365]:
weather_df.loc[weather_df["Holiday"]==True]

Unnamed: 0,day,Time,temperature,windspeed,event,Holiday,Demo
0,1/1/2017,3.0,32.0,6.0,Rain,True,l
1,1/4/2017,5.0,,9.0,Sunny,True,k
5,1/8/2017,5.0,,,Sunny,True,f
6,1/9/2017,2.0,,,,True,j
7,1/10/2017,5.0,34.0,8.0,Cloudy,True,d


#### Indexing & Slicing In Pandas DataFrame

<li>We can slice a dataset from their rows as well as columns.</li>
<li>If we have (5,5) shape data and we want first three rows and first three columns then we need to slice both rows and columns to get a desired shape.</li>
<li>We have df.iloc() method which we can use to do indexing as well as slicing in a dataframe.</li>
<li>Let's practice .iloc() method.</li>


In [366]:
weather_df.iloc[:3, :3]

Unnamed: 0,day,Time,temperature
0,1/1/2017,3.0,32.0
1,1/4/2017,5.0,
2,1/5/2017,2.0,-1.0


In [367]:
weather_df.loc[:3, ["temperature", "Time"]]

Unnamed: 0,temperature,Time
0,32.0,3.0
1,,5.0
2,-1.0,2.0
3,,5.0


In [368]:
weather_df.iloc[:3, :6]

Unnamed: 0,day,Time,temperature,windspeed,event,Holiday
0,1/1/2017,3.0,32.0,6.0,Rain,True
1,1/4/2017,5.0,,9.0,Sunny,True
2,1/5/2017,2.0,-1.0,,Snow,False


In [369]:
weather_df.iloc[:2, :99]

Unnamed: 0,day,Time,temperature,windspeed,event,Holiday,Demo
0,1/1/2017,3.0,32.0,6.0,Rain,True,l
1,1/4/2017,5.0,,9.0,Sunny,True,k


#### Datatype Conversion In Pandas
<li>Pandas astype() is the one of the most important methods. It is used to change data type of a series.</li>
<li>When a pandas dataframe is created from a csv file,the data type is set automatically.</li>
<li>The datatype will not be what it actually should be at times and this is where we can use astype()  to get desired datatype.</li>
<li>For example, a salary column could be imported as string but to do operations we have to convert it into float.</li>
<li>astype() is used to do such data type conversions.</li>

In [370]:
weather_df.drop(columns="Time", inplace=True)

In [371]:
time = [4, 2, 5, 2, 5, 2, 5, 2, 6, 2, 5, 2, 5, ]
weather_df.insert(loc=1, column="Time", value=time)

In [372]:
weather_df.head(3)

Unnamed: 0,day,Time,temperature,windspeed,event,Holiday,Demo
0,1/1/2017,4,32.0,6.0,Rain,True,l
1,1/4/2017,2,,9.0,Sunny,True,k
2,1/5/2017,5,-1.0,,Snow,False,d


In [373]:
weather_df.astype({"Time": "float64"}).dtypes

day             object
Time           float64
temperature    float64
windspeed      float64
event           object
Holiday           bool
Demo            object
dtype: object

#### Value Counts Method

<li>Since series and dataframes are two distinct objects, they have their own unique methods.</li>

<li>Let's look at an example of a series method - the Series.value_counts() method.</li>

<li>This method displays each unique non-null value in a column and their counts in order.</li>

<li>value_counts() is a series only method, we get the following error if we try to use it for dataframes:</li>

<code>
    AttributeError: 'DataFrame' object has no attribute 'value_counts' # Not True
</code>

In [374]:
weather_df["Holiday"].value_counts()

Holiday
False    8
True     5
Name: count, dtype: int64

In [375]:
weather_df["event"].value_counts()

event
Sunny     4
Snow      3
Rain      2
Cloudy    1
Rainy     1
Name: count, dtype: int64

In [376]:
weather_df.value_counts()

day        Time  temperature  windspeed  event   Holiday  Demo
1/1/2017   4      32.0         6.0       Rain    True     l       1
1/10/2017  2      34.0         8.0       Cloudy  True     d       1
1/11/2017  2     -1.0          12.0      Snow    False    k       1
           6     -4.0         -1.0       Snow    False    d       1
1/12/2017  2      26.0         12.0      Sunny   False    d       1
1/13/2017  5      12.0         12.0      Rainy   False    d       1
1/14/2017  5      40.0        -1.0       Sunny   False    d       1
Name: count, dtype: int64

#### Creating a frequency table from value_counts 

In [377]:
freq_table = weather_df.event.value_counts()
pd.DataFrame(freq_table.values, freq_table.index, columns=["frequency"])

Unnamed: 0_level_0,frequency
event,Unnamed: 1_level_1
Sunny,4
Snow,3
Rain,2
Cloudy,1
Rainy,1


#### Renaming the column names in a pandas dataframe

In [378]:
weather_df.head(2)

Unnamed: 0,day,Time,temperature,windspeed,event,Holiday,Demo
0,1/1/2017,4,32.0,6.0,Rain,True,l
1,1/4/2017,2,,9.0,Sunny,True,k


In [379]:
weather_df.rename(columns={"day": "Day", "event": "Event", "windspeed": "Windspeed", "temperature": "Temperature"}, inplace=True)
weather_df.head(1)

Unnamed: 0,Day,Time,Temperature,Windspeed,Event,Holiday,Demo
0,1/1/2017,4,32.0,6.0,Rain,True,l


#### Selecting Items From A Series Method

<li>As with dataframes, we can use Series.loc[] to select items from a series using single labels, a list, or a slice object.</li>
<li>We can also omit loc[] and use bracket shortcuts for all three:</li>

![](images/selecting_series.png)

In [380]:
series = pd.Series(["Ram", "Shyam", "Hari", "Gopal"])
series

0      Ram
1    Shyam
2     Hari
3    Gopal
dtype: object

In [381]:
series.loc[3]

'Gopal'

In [382]:
series.loc[:3]

0      Ram
1    Shyam
2     Hari
3    Gopal
dtype: object

In [383]:
series.loc[2:]

2     Hari
3    Gopal
dtype: object

#### Question

<li>Use the value counts method to check the frequency count of different names from 'appointment_schedule.csv' file.</li>
<li>Select only first row from the series.</li>
<li>Select the first row and the last row from the series.</li>
<li>Select the first five rows and the last five rows from the series.</li>



In [384]:
schedule_df.head(2)

Unnamed: 0,name,appointment_made_date,app_start_date,app_end_date,visitee_namelast,visitee_namefirst,meeting_room,description
0,Joshua T. Blanton,2014-12-18T00:00:00,1/6/15 9:30,1/6/15 23:59,,potus,west wing,JointService Military Honor Guard
1,Jack T. Gutting,2014-12-18T00:00:00,1/6/15 9:30,1/6/15 23:59,,potus,west wing,JointService Military Honor Guard


In [385]:
schedule_df.description.value_counts()

description
JointService Military Honor Guard                      95
military honor guard                                   78
**DELEGATION FOR WORKING LUNCH**                       12
*Dinner Participants*                                  10
marine band                                             6
honor guard                                             5
Travel w POTUS  WW Lobby to West Exec.                  4
State Dept technicians for interpretation equipment     2
US Ambassador to Mexico                                 1
Name: count, dtype: int64

In [386]:
schedule_df.visitee_namefirst.value_counts()

visitee_namefirst
POTUS          376
potus          179
President       26
POTUS/CLARE      2
Charles          1
POTUS/max        1
Name: count, dtype: int64

In [387]:
schedule_df.meeting_room.value_counts()

meeting_room
State Floo    279
west wing     178
RESIDENCE      48
Cabinet Ro     38
The Roosev     13
OVAL OFFIC     12
state floo      6
WW Lobby        4
WEST WING       2
ww/Oval Of      2
Oval Offic      1
ew 206          1
Roosevelt       1
Name: count, dtype: int64

In [388]:
# selecting first row of the DF
schedule_df.iloc[0]

name                                     Joshua T. Blanton
appointment_made_date                  2014-12-18T00:00:00
app_start_date                                 1/6/15 9:30
app_end_date                                  1/6/15 23:59
visitee_namelast                                       NaN
visitee_namefirst                                    potus
meeting_room                                     west wing
description              JointService Military Honor Guard
Name: 0, dtype: object

In [389]:
# last one row
schedule_df.iloc[-1]

name                          Martin O. Reina
appointment_made_date     2015-01-09T00:00:00
app_start_date                  1/16/15 10:00
app_end_date                    1/16/15 23:59
visitee_namelast                          NaN
visitee_namefirst                       potus
meeting_room                        west wing
description              military honor guard
Name: 584, dtype: object

In [390]:
# first 5 and last 5 rows
first_5 = schedule_df.head(5)
last_5 = schedule_df.tail(5)

#### DataFrame Vs DataSeries

![](images/dataframe_vs_series.png)

#### Summary

![](images/pandas_selection_summary.png)

#### Vecotrized Operations In Pandas

<li>We'll explore how pandas uses many of the concepts we learned in the NumPy.</li>
<li>Because pandas is designed to operate like NumPy, a lot of concepts and methods from Numpy are supported.</li>
<li>Recall that one of the ways NumPy makes working with data easier is with vectorized operations.</li>
<li>Just like with NumPy, we can use any of the standard Python numeric operators with series, including:</li>
<code>
    series_a + series_b - Addition
    series_a - series_b - Subtraction
    series_a * series_b - Multiplication
    series_a / series_b - Division
</code>

In [391]:
a = pd.Series([4, 3, 6, 2])
b = pd.Series([5, 2, 5, 2])
a + b - 6

0    3
1   -1
2    5
3   -2
dtype: int64

In [392]:
a * b - 88

0   -68
1   -82
2   -58
3   -84
dtype: int64

In [393]:
weather_df["Temperature"] + weather_df["Windspeed"]

0     38.0
1      NaN
2      NaN
3      NaN
4      NaN
5      NaN
6      NaN
7     42.0
8     -5.0
9     38.0
10    24.0
11    11.0
12    39.0
dtype: float64

In [394]:
weather_df["Time"] = weather_df["Time"] + 3

In [395]:
weather_df["Time"]

0     7
1     5
2     8
3     5
4     8
5     5
6     8
7     5
8     9
9     5
10    8
11    5
12    8
Name: Time, dtype: int64

In [396]:
weather_df.head(3)

Unnamed: 0,Day,Time,Temperature,Windspeed,Event,Holiday,Demo
0,1/1/2017,7,32.0,6.0,Rain,True,l
1,1/4/2017,5,,9.0,Sunny,True,k
2,1/5/2017,8,-1.0,,Snow,False,d


#### Some Statistical Functions In Pandas

<li>Like NumPy, Pandas supports many descriptive stats methods such as mean, median, mode, min, max and so on.</li>
<li>Here are a few of the most useful ones.</li>
<code>
Series.max()
Series.min()
Series.mean()
Series.median()
Series.mode()
Series.sum()
</code>
<li>We can calculate the average value of a particular column(series) using df.column_name.mean().</li>
<li>For calculating the minimum value in a particular column(series), we can use df.column_name.min().</li>
<li>Similarly, for calculating the maximum value in a particular column(series), we can use df.column_name.max().</li>

In [397]:
# max value
weather_df.Windspeed.max()

12.0

In [398]:
# min value
weather_df.Temperature.min()

-4.0

In [399]:
# mean
weather_df.Temperature.mean()

18.88888888888889

In [400]:
# sum
weather_df.Windspeed.sum()

64.0

In [401]:
# median
weather_df.Temperature.median()

26.0

In [402]:
# mode 
schedule_df.description.mode()

0    JointService Military Honor Guard
Name: description, dtype: object

#### Finding the descriptive statistics of the dataframe using .describe() method

<li>Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset's distribution, excluding NaN values.</li>
<li>describe() method in Pandas is used to compute descriptive statistics for all of your numeric columns.</li>
<li>Analyzes both numeric and object series, as well as DataFrame column sets of mixed data types.</li>
<li>The output will vary depending on what is provided.</li>
<li>If we want to see the descriptive statistics of an object datatype then we have to specify <b>df.describe(include = "O")</b></li>

In [403]:
weather_df.describe()

Unnamed: 0,Time,Temperature,Windspeed
count,13.0,9.0,9.0
mean,6.615385,18.888889,7.111111
std,1.609268,17.431612,5.109903
min,5.0,-4.0,-1.0
25%,5.0,-1.0,6.0
50%,7.0,26.0,8.0
75%,8.0,32.0,12.0
max,9.0,40.0,12.0


In [404]:
weather_df.describe(exclude=["int64", "float64", "bool"])

Unnamed: 0,Day,Event,Demo
count,13,11,13
unique,12,5,5
top,1/11/2017,Sunny,d
freq,2,4,6


In [405]:
schedule_df.describe(include="object")

Unnamed: 0,name,appointment_made_date,app_start_date,app_end_date,visitee_namelast,visitee_namefirst,meeting_room,description
count,585,585,585,585,56,585,585,213
unique,542,11,23,9,5,6,13,9
top,Jesus MurilloKaram,2015-01-09T00:00:00,1/12/15 13:00,1/12/15 23:59,/,POTUS,State Floo,JointService Military Honor Guard
freq,3,247,217,286,36,376,279,95


#### Assigning Values With Pandas

<li>Just like in NumPy, the same techniques that we use to select data could be used for assignment.</li>

<li>When we selected a whole column by label and used assignment, we assigned the value to every item in that column.</li>

<li>By providing labels for both axes, we can assign them to a single value within our dataframe.</li>

<code>
    df.loc[row_label, col_label] = assignment_value
</code>

In [406]:
import pandas as pd

In [407]:
weather_df.loc[1, "Temperature"] = 22
weather_df.head(2)

Unnamed: 0,Day,Time,Temperature,Windspeed,Event,Holiday,Demo
0,1/1/2017,7,32.0,6.0,Rain,True,l
1,1/4/2017,5,22.0,9.0,Sunny,True,k


In [408]:
weather_df.Time.astype(dtype="float64")
weather_df.loc[0, "Time"] = 3.15
weather_df.head(2)

  weather_df.loc[0, "Time"] = 3.15


Unnamed: 0,Day,Time,Temperature,Windspeed,Event,Holiday,Demo
0,1/1/2017,3.15,32.0,6.0,Rain,True,l
1,1/4/2017,5.0,22.0,9.0,Sunny,True,k


#### Using Boolean Indexing With Pandas Objects (Selection With Condition In Pandas)
<li>We can assign a value by using row label and column label in pandas.</li>
<li>But what if we need to assign a same value to a group of similar rows with the same criteria.</li>
<li> Instead, we can use boolean indexing to change all rows that meet the same criteria, just like we did with NumPy.</li>


<ol>
    <li>Equals: df['series'] == value</li>
    <li>Not Equals: df['series'] != value</li>
    <li>Less than: df['series'] < value</li>
    <li>Less than or equal to: df['series'] <= value</li>
    <li>Greater than: df['series'] > value</li>
    <li>Greater than or equal to: df['series'] >= value</li>
</ol>
<li>These conditions can be used in several ways, most commonly inside .loc to select values with conditions.</li>

In [409]:
weather_df["Event"] == "Sunny"

0     False
1      True
2     False
3     False
4     False
5      True
6     False
7     False
8     False
9      True
10    False
11    False
12     True
Name: Event, dtype: bool

In [410]:
weather_df["Holiday"] != True

0     False
1     False
2      True
3      True
4      True
5     False
6     False
7     False
8      True
9      True
10     True
11     True
12     True
Name: Holiday, dtype: bool

In [411]:
weather_df["Temperature"] > 5

0      True
1      True
2     False
3     False
4      True
5     False
6     False
7      True
8     False
9      True
10     True
11    False
12     True
Name: Temperature, dtype: bool

In [412]:
weather_df.loc[weather_df["Windspeed"] <= 11]

Unnamed: 0,Day,Time,Temperature,Windspeed,Event,Holiday,Demo
0,1/1/2017,3.15,32.0,6.0,Rain,True,l
1,1/4/2017,5.0,22.0,9.0,Sunny,True,k
3,1/6/2017,5.0,,7.0,,False,f
7,1/10/2017,5.0,34.0,8.0,Cloudy,True,d
8,1/11/2017,9.0,-4.0,-1.0,Snow,False,d
12,1/14/2017,8.0,40.0,-1.0,Sunny,False,d


In [413]:
weather_df.loc[(weather_df["Time"] > 6) & ( weather_df["Event"] == "Sunny")]

Unnamed: 0,Day,Time,Temperature,Windspeed,Event,Holiday,Demo
12,1/14/2017,8.0,40.0,-1.0,Sunny,False,d


### Using Pandas Method To Create a Boolean Mask

<li>In the last couple lessons, we used Python boolean operators to create boolean masks to select subsets of data.</li>
    
<li>There are also a number of pandas methods that return boolean masks useful for exploring data.</li>

<li>Two examples are the Series.isnull() method and Series.notnull() method.</li>
<li>Series.isnull() method can be used to select either rows that contain null (or NaN) values for a certain column.</li>
<li>Similarly, Series.notnull() method is used to select rows that do not contain null values for a certain column.</li>

#### Question 1

<li>Read 'Fortune_1000.csv' file using pandas read_csv() method and store it in a variable named f1000.</li>
<li>Select the rank, revenues, and rank_change columns in f1000. Then, use the df.head() method to select first five rows.</li>
<li>Select just the fifth row of the f1000 dataframe. Assign the result to fifth_row using iloc.</li>
<li>Select the value in first row of the company column. Assign the result to company_value.</li>
<li>Select the last three rows of the f1000 dataframe. Assign the result to last_three_rows.</li>
<li>Select the first to seventh rows and the first five columns of the f1000 dataframe.</li>



In [414]:
f1000 = pd.read_csv("data/Fortune_1000.csv")
f1000.head(2)

Unnamed: 0,company,rank,rank_change,revenue,profit,num. of employees,sector,city,state,newcomer,ceo_founder,ceo_woman,profitable,prev_rank,CEO,Website,Ticker,Market Cap
0,Walmart,1,0.0,572754.0,13673.0,2300000.0,Retailing,Bentonville,AR,no,no,no,yes,1.0,C. Douglas McMillon,https://www.stock.walmart.com,WMT,352037
1,Amazon,2,0.0,469822.0,33364.0,1608000.0,Retailing,Seattle,WA,no,no,no,yes,2.0,Andrew R. Jassy,www.amazon.com,AMZN,1202717


In [415]:
f1000[["rank", "revenue", "rank_change"]].head(5)

Unnamed: 0,rank,revenue,rank_change
0,1,572754.0,0.0
1,2,469822.0,0.0
2,3,365817.0,0.0
3,4,292111.0,0.0
4,5,287597.0,0.0


In [416]:
fifth_row = f1000.iloc[4, :]
fifth_row

company                     UnitedHealth Group
rank                                         5
rank_change                                0.0
revenue                               287597.0
profit                                 17285.0
num. of employees                     350000.0
sector                             Health Care
city                                Minnetonka
state                                       MN
newcomer                                    no
ceo_founder                                 no
ceo_woman                                   no
profitable                                 yes
prev_rank                                  5.0
CEO                            Andrew P. Witty
Website              www.unitedhealthgroup.com
Ticker                                     UNH
Market Cap                              500468
Name: 4, dtype: object

In [417]:
company_value = f1000.loc[:, "company"]
company_value

0                 Walmart
1                  Amazon
2                   Apple
3              CVS Health
4      UnitedHealth Group
              ...        
995         Vizio Holding
996     1-800-Flowers.com
997                 Cowen
998               Ashland
999              DocuSign
Name: company, Length: 1000, dtype: object

In [418]:
last_three_rows = f1000.iloc[-3:]
last_three_rows

Unnamed: 0,company,rank,rank_change,revenue,profit,num. of employees,sector,city,state,newcomer,ceo_founder,ceo_woman,profitable,prev_rank,CEO,Website,Ticker,Market Cap
997,Cowen,998,0.0,2112.8,295.6,1534.0,Financials,New York,NY,no,no,no,yes,,Jeffrey Solomon,https://www.cowen.com,COWN,1078.0
998,Ashland,999,0.0,2111.0,220.0,4100.0,Chemicals,Wilmington,DE,no,no,no,yes,,Guillermo Novo,https://www.ashland.com,ASH,5601.9
999,DocuSign,1000,0.0,2107.2,-70.0,7461.0,Technology,San Francisco,CA,no,no,no,no,,Allan C. Thygesen,https://www.docusign.com,DOCU,21302.8


In [419]:
five_row_seven_col = f1000.iloc[:5, :7]
five_row_seven_col

Unnamed: 0,company,rank,rank_change,revenue,profit,num. of employees,sector
0,Walmart,1,0.0,572754.0,13673.0,2300000.0,Retailing
1,Amazon,2,0.0,469822.0,33364.0,1608000.0,Retailing
2,Apple,3,0.0,365817.0,94680.0,154000.0,Technology
3,CVS Health,4,0.0,292111.0,7910.0,258000.0,Health Care
4,UnitedHealth Group,5,0.0,287597.0,17285.0,350000.0,Health Care


#### Question 2
<li>Use the Series.isnull() method to select all rows from f1000 that have a null value for the prev_rank column.</li>
<li>Select only the company, rank, and previous_rank columns where previous_rank column is null.</li>
<li>Use the Series.notnull() method to select all rows from f1000 that have a non-null value for the previous_rank column.</li></b>
<li>From the previously_ranked dataframe, subtract the rank column from the previous_rank column.</li>
<li>Assign the values in the rank_change to a new column in the f1000 dataframe, "rank_change".</li>

In [420]:
# print(list(f1000["prev_rank"]))

In [421]:
f1000["prev_rank"] = pd.to_numeric(f1000["prev_rank"], errors="coerce")
f1000["prev_rank"] = f1000["prev_rank"].astype("Int32")
f1000["prev_rank"].isnull().value_counts()

prev_rank
True     531
False    469
Name: count, dtype: int64

In [443]:
f1000[f1000["prev_rank"].isnull()][["company", "rank", "prev_rank"]]

Unnamed: 0,company,rank,prev_rank
170,Cleveland-Cliffs,171,
194,Moderna,195,
308,Devon Energy,309,
321,International Flavors & Fragrances,322,
334,Caesars Entertainment,335,
...,...,...,...
995,Vizio Holding,996,
996,1-800-Flowers.com,997,
997,Cowen,998,
998,Ashland,999,


In [423]:
f1000["prev_rank"].notnull()

0       True
1       True
2       True
3       True
4       True
       ...  
995    False
996    False
997    False
998    False
999    False
Name: prev_rank, Length: 1000, dtype: bool

In [424]:
rank_change = f1000["prev_rank"] - f1000["rank"]
f1000["rank_change"] = rank_change
f1000.sample(5)

Unnamed: 0,company,rank,rank_change,revenue,profit,num. of employees,sector,city,state,newcomer,ceo_founder,ceo_woman,profitable,prev_rank,CEO,Website,Ticker,Market Cap
467,Brighthouse Financial,468,-115.0,7142.0,-108.0,1500.0,Financials,Charlotte,NC,no,no,no,no,353.0,Eric T. Steigerwalt,https://www.brighthousefinancial.com,BHF,3958.7
556,Concentrix,557,,5587.0,405.6,290000.0,Technology,Fremont,CA,no,no,no,yes,,Christopher A. Caldwell,https://www.concentrix.com,CNXC,8605.2
745,A.O. Smith,746,,3538.9,487.1,13700.0,Industrials,Milwaukee,WI,no,no,no,yes,,Kevin J. Wheeler,https://www.aosmith.com,AOS,10032.2
503,Agilent Technologies,504,,6319.0,1210.0,17000.0,Technology,Santa Clara,CA,no,no,no,yes,,Michael R. McMullen,https://www.agilent.com,A,39714.0
269,Marriott International,270,23.0,13857.0,1099.0,120000.0,"Hotels, Restaurants & Leisure",Bethesda,MD,no,no,no,yes,293.0,Anthony G. Capuano,https://www.marriott.com,MAR,57514.9


#### Question 3
<li>Select all companies with revenues over 100 thousands and negative profits from the f1000 dataframe.</li>

##### Instructions

<li>Create a boolean array that selects the companies with revenues greater than 100 thousands.</li>
<li>Create a boolean array that selects the companies with profits less than 0.</li>


In [446]:
f1000[(f1000["revenue"] > 100000) & (f1000["profit"] < 0)]

Unnamed: 0,company,rank,rank_change,revenue,profit,num. of employees,sector,city,state,newcomer,ceo_founder,ceo_woman,profitable,prev_rank,CEO,Website,Ticker,Market Cap
8,McKesson,9,-2,238228.0,-4539.0,67500.0,Health Care,Irving,TX,no,no,no,no,7,Brian S. Tyler,www.mckesson.com,MCK,47377


#### Question 4
<li>Select all rows for companies whose city value is either Brazil or Venezuela.</li>
<li>Select the first five companies in the Technology sector for which the city is not the "Boston" from the f1000 dataframe.</li>

In [455]:
f1000[(f1000["city"] == "Brazil") | (f1000["city"] == "Venezuela")]

Unnamed: 0,company,rank,rank_change,revenue,profit,num. of employees,sector,city,state,newcomer,ceo_founder,ceo_woman,profitable,prev_rank,CEO,Website,Ticker,Market Cap


In [462]:
f1000[(f1000["sector"] == "Technology") & (f1000["city"] != "Boston")].head(5)

Unnamed: 0,company,rank,rank_change,revenue,profit,num. of employees,sector,city,state,newcomer,ceo_founder,ceo_woman,profitable,prev_rank,CEO,Website,Ticker,Market Cap
2,Apple,3,0,365817.0,94680.0,154000.0,Technology,Cupertino,CA,no,no,no,yes,3,Timothy D. Cook,www.apple.com,AAPL,2443962
7,Alphabet,8,1,257637.0,76033.0,156500.0,Technology,Mountain View,CA,no,no,no,yes,9,Sundar Pichai,https://www.abc.xyz,GOOGL,1309359
13,Microsoft,14,1,168088.0,61271.0,181000.0,Technology,Redmond,WA,no,no,no,yes,15,Satya Nadella,www.microsoft.com,MSFT,1941033
26,Meta Platforms,27,7,117929.0,39370.0,71970.0,Technology,Menlo Park,CA,no,yes,no,yes,34,Mark Zuckerberg,https://investor.fb.com,META,475718
30,Dell Technologies,31,-3,106995.0,5563.0,133000.0,Technology,Round Rock,TX,no,yes,no,yes,28,Michael S. Dell,www.delltechnologies.com,DELL,32568


In [456]:
f1000.columns

Index(['company', 'rank', 'rank_change', 'revenue', 'profit',
       'num. of employees', 'sector', 'city', 'state', 'newcomer',
       'ceo_founder', 'ceo_woman', 'profitable', 'prev_rank', 'CEO', 'Website',
       'Ticker', 'Market Cap'],
      dtype='object')

#### Sorting Values
<li>We can use the DataFrame.sort_values() method to sort the rows on a particular column.</li>
<li>To do so, we pass the column name to the method:</li>
<code>
sorted_rows = df.sort_values("column_name")
</code>
<li>By default, the sort_values() method will sort the rows in ascending order — from smallest to largest.</li>
<li>To sort the rows in descending order instead, we can set the ascending parameter to False:</li>
<code>
    sorted_rows = df.sort_values("column_name", ascending=False)
</code>


In [465]:
f1000.columns

Index(['company', 'rank', 'rank_change', 'revenue', 'profit',
       'num. of employees', 'sector', 'city', 'state', 'newcomer',
       'ceo_founder', 'ceo_woman', 'profitable', 'prev_rank', 'CEO', 'Website',
       'Ticker', 'Market Cap'],
      dtype='object')

#### Question
<li>Read 'Fortune_1000.csv' using pandas read_csv() method.</li>
<li>Find the company headquartered in Los Angeles with the largest number of employees.</li>
<li>Select only the rows that have a city name equal to Los Angeles.</li>
<li>Use DataFrame.sort_values() to sort those rows by the employees column in descending order.</li>
<li>Use DataFrame.iloc[] to select the first row from the sorted dataframe.</li>


In [482]:
f1000[(f1000["city"] == "Los Angeles") & (f1000["num. of employees"] == f1000[f1000["city"] == "Los Angeles"]["num. of employees"].max())]
# l_a[l_a["num. of employees"] == l_a["num. of employees"].max()]


Unnamed: 0,company,rank,rank_change,revenue,profit,num. of employees,sector,city,state,newcomer,ceo_founder,ceo_woman,profitable,prev_rank,CEO,Website,Ticker,Market Cap
260,Reliance Steel & Aluminum,261,82,14093.3,1413.0,13950.0,Materials,Los Angeles,CA,no,no,no,yes,343,James D. Hoffman,https://www.rsac.com,RS,11313.5


In [484]:
los_angeles = f1000[f1000["city"] == "Los Angeles"]
los_angeles

Unnamed: 0,company,rank,rank_change,revenue,profit,num. of employees,sector,city,state,newcomer,ceo_founder,ceo_woman,profitable,prev_rank,CEO,Website,Ticker,Market Cap
260,Reliance Steel & Aluminum,261,82.0,14093.3,1413.0,13950.0,Materials,Los Angeles,CA,no,no,no,yes,343.0,James D. Hoffman,https://www.rsac.com,RS,11313.5
542,KB Home,543,,5724.9,564.7,2244.0,Engineering & Construction,Los Angeles,CA,no,no,no,yes,,Jeffrey T. Mezger,https://www.kbhome.com,KBH,2858.1
626,Ares Management,627,,4770.6,408.8,2100.0,Financials,Los Angeles,CA,no,yes,no,yes,,Michael Arougheti,https://www.aresmgmt.com,ARES,18940.0
692,Mercury General,693,,3993.4,247.9,4300.0,Financials,Los Angeles,CA,no,no,no,yes,,Gabriel Tirador,https://www.mercuryinsurance.com,MCY,3045.4
910,Guess,911,,2591.6,171.4,12500.0,Retailing,Los Angeles,CA,no,no,no,yes,,Carlos E. Alberini,https://www.guess.com,GES,1305.5


In [494]:
los_angeles.sort_values("num. of employees", ascending=False, inplace=True)
los_angeles.iloc[260]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  los_angeles.sort_values("num. of employees", ascending=False, inplace=True)


IndexError: single positional indexer is out-of-bounds

### String Manipulation In Pandas DataFrame

<li>String manipulation is the process of changing, parsing, splitting, 'cleaning' or analyzing strings.</li>
<li>As we know that sometimes, data in the string is not suitable for manipulating the analysis or get a description of the data.</li>
<li>But Python is known for its ability to manipulate strings.</li>
<li>Pandas provides us the ways to manipulate to modify and process string data-frame using some builtin functions.</li>
<li>Some of the most useful pandas string processing functions are as follows:</li>
<ol>
    <li><b>lower()</b></li>
    <li><b>upper()</b></li>
    <li><b>islower()</b></li>
    <li><b>isupper()</b></li>
    <li><b>isnumeric()</b></li>
    <li><b>strip()</b></li>
    <li><b>split()</b></li>
    <li><b>len()</b></li>
    <li><b>get_dummies()</b></li>
    <li><b>startswith()</b></li>
    <li><b>endswith()</b></li>
    <li><b>replace()</b></li>
    <li><b>contains()</b></li>
</ol>


#### 1. lower(): 
<li>It converts all uppercase characters in strings in the dataframe to lower case and returns the lowercase strings in the result.</li>


In [425]:
weather_df.Event = weather_df.Event.str.lower()
weather_df

Unnamed: 0,Day,Time,Temperature,Windspeed,Event,Holiday,Demo
0,1/1/2017,3.15,32.0,6.0,rain,True,l
1,1/4/2017,5.0,22.0,9.0,sunny,True,k
2,1/5/2017,8.0,-1.0,,snow,False,d
3,1/6/2017,5.0,,7.0,,False,f
4,1/7/2017,8.0,32.0,,rain,False,l
5,1/8/2017,5.0,,,sunny,True,f
6,1/9/2017,8.0,,,,True,j
7,1/10/2017,5.0,34.0,8.0,cloudy,True,d
8,1/11/2017,9.0,-4.0,-1.0,snow,False,d
9,1/12/2017,5.0,26.0,12.0,sunny,False,d


#### 2. upper():
<li>It converts all lowercase characters in strings in the dataframe to upper case and returns the uppercase strings in result.</li>


In [426]:
weather_df.Event = weather_df.Event.str.upper()
weather_df

Unnamed: 0,Day,Time,Temperature,Windspeed,Event,Holiday,Demo
0,1/1/2017,3.15,32.0,6.0,RAIN,True,l
1,1/4/2017,5.0,22.0,9.0,SUNNY,True,k
2,1/5/2017,8.0,-1.0,,SNOW,False,d
3,1/6/2017,5.0,,7.0,,False,f
4,1/7/2017,8.0,32.0,,RAIN,False,l
5,1/8/2017,5.0,,,SUNNY,True,f
6,1/9/2017,8.0,,,,True,j
7,1/10/2017,5.0,34.0,8.0,CLOUDY,True,d
8,1/11/2017,9.0,-4.0,-1.0,SNOW,False,d
9,1/12/2017,5.0,26.0,12.0,SUNNY,False,d


#### 3. islower(): 
<li>It checks whether all characters in each string in the Data-Frame is in lower case or not, and returns a Boolean value.</li>


In [427]:
weather_df.Event.str.islower()

0     False
1     False
2     False
3       NaN
4     False
5     False
6       NaN
7     False
8     False
9     False
10    False
11    False
12    False
Name: Event, dtype: object

#### 4. isupper(): 
<li>It checks whether all characters in each string in the Data-Frame is in upper case or not, and returns a Boolean value.</li>


In [428]:
weather_df.Event.str.isupper()

0     True
1     True
2     True
3      NaN
4     True
5     True
6      NaN
7     True
8     True
9     True
10    True
11    True
12    True
Name: Event, dtype: object

#### 5. isnumeric():
<li>It checks whether all characters in each string in the Data-Frame are numeric or not, and returns a Boolean value.</li>


In [429]:
df.temperature.str.isnumeric()

0      True
1     False
2     False
3     False
4      True
5     False
6     False
7      True
8     False
9      True
10     True
11    False
12     True
Name: temperature, dtype: bool

#### 6. strip():
<li>If there are spaces at the beginning or end of a string, we should trim the strings to eliminate spaces using strip() method.</li>
<li>It remove the extra spaces contained by a string in a DataFrame.</li>


In [430]:
weather_df.Event.str.strip()

0       RAIN
1      SUNNY
2       SNOW
3        NaN
4       RAIN
5      SUNNY
6        NaN
7     CLOUDY
8       SNOW
9      SUNNY
10     RAINY
11      SNOW
12     SUNNY
Name: Event, dtype: object

#### 7. split(‘ ‘):
<li>It splits each string with the given pattern.</li>
<li>Strings are split and the new elements after the performed split operation, are stored in a list.</li>


In [431]:
schedule_df.description.str.split().head()

0    [JointService, Military, Honor, Guard]
1    [JointService, Military, Honor, Guard]
2    [JointService, Military, Honor, Guard]
3    [JointService, Military, Honor, Guard]
4    [JointService, Military, Honor, Guard]
Name: description, dtype: object

#### 8. len():
<li>With the help of len() we can compute the length of each string in DataFrame.</li>
<li>If there is empty data in a DataFrame, it returns NaN.</li>


In [432]:
weather_df.Event.str.len()

0     4.0
1     5.0
2     4.0
3     NaN
4     4.0
5     5.0
6     NaN
7     6.0
8     4.0
9     5.0
10    5.0
11    4.0
12    5.0
Name: Event, dtype: float64

#### 9. get_dummies(): 
<li>It returns the DataFrame with One-Hot Encoded values like we can see that it returns boolean value 1 if it exists in relative index or 0 if not exists.</li>


In [433]:
pd.get_dummies(weather_df)

Unnamed: 0,Time,Temperature,Windspeed,Holiday,Day_1/1/2017,Day_1/10/2017,Day_1/11/2017,Day_1/12/2017,Day_1/13/2017,Day_1/14/2017,...,Event_CLOUDY,Event_RAIN,Event_RAINY,Event_SNOW,Event_SUNNY,Demo_d,Demo_f,Demo_j,Demo_k,Demo_l
0,3.15,32.0,6.0,True,True,False,False,False,False,False,...,False,True,False,False,False,False,False,False,False,True
1,5.0,22.0,9.0,True,False,False,False,False,False,False,...,False,False,False,False,True,False,False,False,True,False
2,8.0,-1.0,,False,False,False,False,False,False,False,...,False,False,False,True,False,True,False,False,False,False
3,5.0,,7.0,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
4,8.0,32.0,,False,False,False,False,False,False,False,...,False,True,False,False,False,False,False,False,False,True
5,5.0,,,True,False,False,False,False,False,False,...,False,False,False,False,True,False,True,False,False,False
6,8.0,,,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
7,5.0,34.0,8.0,True,False,True,False,False,False,False,...,True,False,False,False,False,True,False,False,False,False
8,9.0,-4.0,-1.0,False,False,False,True,False,False,False,...,False,False,False,True,False,True,False,False,False,False
9,5.0,26.0,12.0,False,False,False,False,True,False,False,...,False,False,False,False,True,True,False,False,False,False


In [434]:
data_f = pd.DataFrame([["Rajendra", 23, "Bhaktapur"], ["Mohan", 22, "Kathmandu"]])
pd.get_dummies(data_f)

Unnamed: 0,1,0_Mohan,0_Rajendra,2_Bhaktapur,2_Kathmandu
0,23,False,True,True,False
1,22,True,False,False,True


In [435]:
pd.get_dummies(weather_df["Event"]) # one hot encoding

Unnamed: 0,CLOUDY,RAIN,RAINY,SNOW,SUNNY
0,False,True,False,False,False
1,False,False,False,False,True
2,False,False,False,True,False
3,False,False,False,False,False
4,False,True,False,False,False
5,False,False,False,False,True
6,False,False,False,False,False
7,True,False,False,False,False
8,False,False,False,True,False
9,False,False,False,False,True


#### 10. startswith(pattern):
<li>It returns true if the element or string in the DataFrame Index starts with the pattern.</li>
<li>If you wanted to filter out rows that startswith 'ind' then you can specify df[df[col].str.startswith('ind')</li>


In [436]:
weather_df.Event.str.startswith("SU")

0     False
1      True
2     False
3       NaN
4     False
5      True
6       NaN
7     False
8     False
9      True
10    False
11    False
12     True
Name: Event, dtype: object

#### 11. endswith(pattern):
<li>It returns true if the element or string in the DataFrame Index ends with the pattern.</li>
<li>If you wanted to filter out rows that ends with 'es' then you can specify df[df[col].str.endswith('es')</li>


In [437]:
weather_df.Event.str.endswith("NNY")

0     False
1      True
2     False
3       NaN
4     False
5      True
6       NaN
7     False
8     False
9      True
10    False
11    False
12     True
Name: Event, dtype: object

#### 12. replace(a,b):
<li>It replaces the value a with the value b.</li>
<li>If you wanted to remove white space characters then you can use replace() method as:</li>
<code>
df[col_name].str.replace(" ", "")
</code>


In [438]:
weather_df["Demo"] = weather_df.Demo.str.replace("d", "Replace")
weather_df

Unnamed: 0,Day,Time,Temperature,Windspeed,Event,Holiday,Demo
0,1/1/2017,3.15,32.0,6.0,RAIN,True,l
1,1/4/2017,5.0,22.0,9.0,SUNNY,True,k
2,1/5/2017,8.0,-1.0,,SNOW,False,Replace
3,1/6/2017,5.0,,7.0,,False,f
4,1/7/2017,8.0,32.0,,RAIN,False,l
5,1/8/2017,5.0,,,SUNNY,True,f
6,1/9/2017,8.0,,,,True,j
7,1/10/2017,5.0,34.0,8.0,CLOUDY,True,Replace
8,1/11/2017,9.0,-4.0,-1.0,SNOW,False,Replace
9,1/12/2017,5.0,26.0,12.0,SUNNY,False,Replace


#### 13. contains():
<li>contains() method checks whether the string contains a particular substring or not.</li>
<li>The function is quite similar to replace() but instead of replacing the string itself it just returns the boolean value True or False.</li>
<li>If a substring is present in a string, then it returns boolean value True else False.</li>



In [439]:
weather_df.Event.str.contains("UNN")

0     False
1      True
2     False
3       NaN
4     False
5      True
6       NaN
7     False
8     False
9      True
10    False
11    False
12     True
Name: Event, dtype: object

In [440]:
weather_df.Event.str.contains("LOUD")

0     False
1     False
2     False
3       NaN
4     False
5     False
6       NaN
7      True
8     False
9     False
10    False
11    False
12    False
Name: Event, dtype: object

#### Handling Missing Values
<li>We can use fillna() method in pandas to fill missing values using different ways.</li>
<li>We can use interpolation method to make a guess on missing values.</li>
<li>We can use dropna() method to drop rows with missing values.</li>
<li>We can also fill missing values with the mean value, median value or the mode value depending on the values of columns.</li>
<li>Filling missing values with mean and median is appropriate when the column has continuous values.</li>
<li>If the data is categorical then filling missing values with mode is a good idea.</li>

#### fillna(method = 'ffill')

#### fillna(method = 'bfill')

#### Interpolate(Linear Interpolation)
<li>method = time</li>

#### dropna()
<li>dropna() with how and threshold parameter</li>

#### Handle Missing Values using .replace() method

#### Replacing Values Using a Dictionary (using columns and without using columns)

#### Replacing values using a regex
<code>
df.replace(original_value, replaced_value, regex = True)
</code>


#### Mapping values of a particular column using replace method
<li>Replacing the list of values using another list of values</li>
<li>Replacing values of a particular column using a dictionary</li>

#### GroupBy Functions
<li>Pandas groupby is used for grouping the data according to the categories and apply a function to the categories.</li>
<li>It also helps to aggregate data efficiently.</li>
<li>Pandas dataframe.groupby() function is used to split the data into groups based on some criteria.</li>
<code>
    df.groupby(col_name, as_index, sort, dropna)
</code>
<li>It uses split, apply, combine principle to create a groupby dataframe.</li>
<li>The groupby function accepts multiple parameters. Some of them are as follows:</li>
<ol>
    <li>col_name(required): the name of column against which you want to group elements.</li>
    <li>as_index(optional): default = True, if you want to include groupby column as an index set it        to True else False.</li>
    <li>sort(optional): default = True, if you want to sort the group based on keys then keep it as       True else False.</li>
    <li>dropna(optional): default = True, if you keep it as false then it will also include Nan values     as a separate group.</li>
</ol>

### GroupBy Aggregation Functions
<li>Here are some of the aggregating functions available in Pandas and quick summary of what it does.</li>
<ol>
    <li>mean(): Compute mean of groups for numeric columns</li>
    <li>sum(): Compute sum of group values for numeric columns</li>
    <li>size(): Compute group sizes</li>
    <li>count(): Compute count of group</li>
    <li>std(): Standard deviation of groups for numeric columns</li>
    <li>var(): Compute variance of groups for numeric columns</li>
    <li>describe(): Generates descriptive statistics</li>
    <li>first(): Compute first of group values</li>
    <li>last(): Compute last of group values</li>
    <li>nth() : Take nth value, or a subset if n is a list</li>
    <li>min(): Compute min of group values</li>
    <li>max(): Compute max of group values</li>
</ol>

#### Question
<li>Read 'car_details.csv' file and create a pandas dataframe from this file.</li>
<li>Find the maximum price for each of the car brand.</li>
<li>Find the average price for each of the fuel types.</li>
<li>Find the average km_driven for each of the seller_types.</li>
<li>Find the count of each of the car names.</li>
<li>Find the maximum km_driven for each of the owner types.</li>

####  Concatenating DataFrames
<li>pandas.concat() function does all the heavy lifting of performing concatenation operations along with an axis</li>
<li>If we want to join two individual dataframes and create a combined dataframe out of it, we can use concatenation operation for doing so.</li>
<li>We can use concatenation operation along the rows(axis=0) as well as along the columns(axis = 1)</li>

**syntax**

<code>
    pd.concat([df1,df2], axis, keys, ignore_index)
</code>

<li>df1 and df2 (required) are two dataframes which we want to merge.</li>
<li>axis: axis to concatenate along, (possible values; 0(along the rows) and 1 (along the cols) default = 0 (along the rows).</li>
<li>keys: sequence to add an identifier to the result indexes; default = None</li>
<li>ignore_index: if True, do not use the index values along the concatenation axis; default = False</li>

#### Concatenating Dataframes along the rows
![](images/concat_rows.png)

#### Concatenating DataFrames along columns
![](images/concat_cols.png)

#### Merge
<li>Pandas has full-featured, high performance in-memory join operations idiomatically very similar to relational databases like SQL.</li>
<li>Pandas provides a single function, merge, as the entry point for all standard database join operations between DataFrame objects.</li>
<li>The <b>merge()</b> method updates the content of two DataFrame by merging them together, using the specified method(s).</li>
<li>We can use the parameters to control which values to keep and which to replace during merge operation.</li>
<li>We can specify any type of join we want by using how parameter in merge method.</li>
<li>There are four types of join operations. They are :</li>
<ol>
    <b><li>Inner join</li></b>
    <b><li>Left join</li></b>
    <b><li>Right join</li></b>
    <b><li>Outer join</li></b>
</ol>

#### 1. Inner Join
![](images/inner_join.png)

#### 2. Left Join

![](images/left_join.png)

#### 3. Right Join

![](images/right_join.png)

#### 4. Outer Join

![](images/outer_join.png)

#### Crosstab 

<li>Cross tabulation is used to quantitatively analyze the relationship between multiple variables.</li>
<li>Cross tabulations — also referred to as contingency tables or crosstabs.</li>
<li>They group variables together and enable researchers to understand the correlation between different variables.<li>
<li>When we are doing multivariate analysis then we often came across crosstab() methods in pandas.</li>

**Syntax**

<code>
    pd.crosstab(index, columns, values, margins, margin_names, normalize,aggfunc, dropna)
</code>
<ol>
    <li>index : array-like, Series, or list of arrays/Series, Values to group by in the rows.</li>
    <li>columns : array-like, Series, or list of arrays/Series, Values to group by in the columns.</li>
    <li>values : array-like, optional, array of values to aggregate according to the factors. Requires `aggfunc` be specified.     </li>
    <li>aggfunc : function, optional, If specified, requires `values` be specified as well.</li>
    <li>margins : bool, default False, Add row/column margins (subtotals).</li>
    <li>margins_name : str, default ‘All’, Name of the row/column that will contain the totals when margins is True.</li>
    <li>dropna : bool, default True, Do not include columns whose entries are all NaN.</li>
    <li>normalize: </li>
    <ol>
        <li>If passed ‘all’ or True, will normalize over all values.</li>
        <li>If passed ‘index’ will normalize over each row.</li>
        <li>If passed ‘columns’ will normalize over each column.</li>
        <li>If margins is True, will also normalize margin values.</li>
    </ol>
</ol>

#### Pivot
<li>pivot() method produces pivot table based on 3 columns of the DataFrame. Uses unique values from index / columns and fills with values.</li>

    
**syntax**
<code>
pd.pivot(index, columns, values)
</code>
    
<b>Parameters:</b>
<ol>
    <li>index[ndarray] : Labels to use to make new frame’s index</li>
    <li>columns[ndarray] : Labels to use to make new frame’s columns</li>
    <li>values[ndarray] : Values to use for populating new frame’s values</li>
</ol>

**Returns: Reshaped DataFrame**

**Exception: ValueError raised if there are any duplicates.**