## Pandas DataFrames

A Dataframe is the most widely used data structure in data analysis. It is __a table with rows and columns, with rows having an index each and columns having meaningful names__. There are various ways of creating dataframes, for instance, creating them from dictionaries, reading from .txt and .csv files, etc. 

First, let us import all the necessary libraries.

In [1]:
import numpy as np
import pandas as pd

Now, let us move ahead, and see how to create dataframes.

### Creating DataFrames

There are multiple ways of creating a DataFrame in Pandas. But, for that, we need to have some data first. This data can be in the form of an external data file (txt or csv or xlsx), or it can be present locally as a data structure. 

To convert the data into DataFrame, we use the DataFrame() method in Pandas. It takes a data structure, preferably a dictionary as argument. It has the following syntax - 

__`identifier = pd.DataFrame(data)`__

Let us look at some examples.

In [2]:
details={'First Name':['Elon','Jeff','Albert','Ratan','Sam'],
         'Last Name':['Musk','Bezos','Einstein','Tata','Tull'],
         'Age':[50,53,76,83,25],
         'Company Name':['SpaceX','Blue Origin','Tesla','Tata Group','Norwack Steel']
        }

df1=pd.DataFrame(details)
df1

Unnamed: 0,First Name,Last Name,Age,Company Name
0,Elon,Musk,50,SpaceX
1,Jeff,Bezos,53,Blue Origin
2,Albert,Einstein,76,Tesla
3,Ratan,Tata,83,Tata Group
4,Sam,Tull,25,Norwack Steel


This is how a DataFrame is created using a dictionary. Dictionaries are the most preferred way to create dataframes. Let us now use some other methods to create dictionaries - 

- __`pd.read_csv(data)`__ - This method is used to extract a table from a csv file.
- __`pd.read_excel(data)`__ - This method is used to extract a table from an excel file.

Let us see them in action.

In [4]:
topics=pd.read_excel('https://github.com/yashj1301/Python3-UpGrad-UMich/blob/master/Python%203.x/Upgrad/Modules/Module%203%20-%20Python%20for%20Data%20Science/Session%203%20-%20Pandas/Data/Topics.xlsx?raw=true')
topics

Unnamed: 0,S.No.,Name of Topic,Subject,Book Referred
0,1,Probability Theory,Data Analysis and Interpretations,Probability and Statistics for Engineers and S...
1,2,Random Variables,Data Analysis and Interpretations,Probability and Statistics for Engineers and S...
2,3,Discrete and Continuous Probability Distributions,Data Analysis and Interpretations,Probability and Statistics for Engineers and S...
3,4,Normal Distribution,Data Analysis and Interpretations,Probability and Statistics for Engineers and S...
4,5,Descriptive Statistics and Sampling,Data Analysis and Interpretations,Probability and Statistics for Engineers and S...
...,...,...,...,...
337,338,Capacitive Sensing and Resonant Drive Circuits,Microsystems - Analysis and Design,Electromechanics and MEMS by T Jones and N Nen...
338,339,Distributed 1-D and 2-D Capacitive Electromech...,Microsystems - Analysis and Design,Electromechanics and MEMS by T Jones and N Nen...
339,340,Practical MEMS Devices,Microsystems - Analysis and Design,Electromechanics and MEMS by T Jones and N Nen...
340,341,Electromechanics of Piezoelectric Elements,Microsystems - Analysis and Design,Electromechanics and MEMS by T Jones and N Nen...


In [5]:
schedule=pd.read_csv('https://github.com/yashj1301/Python3-UpGrad-UMich/raw/master/Python%203.x/Upgrad/Modules/Module%203%20-%20Python%20for%20Data%20Science/Session%203%20-%20Pandas/Data/schedule.csv')
schedule

Unnamed: 0,Index,Name of the Course,Name of the Certification,Total Modules,Modules Completed,Modules remaining,Due date,Status
0,1,Interactivity with Javascript,Web Design for Everybody,4,1,3,11-09-2022,0%
1,2,Data Science Methodology,IBM Data Science,3,1,2,11-09-2022,0%
2,3,Ask Questions to Make Data Driven Decisions,Google Data Analytics,4,2,2,11-09-2022,50%
3,4,Python Data Structures,Python Programming for Everybody,7,0,7,11-09-2022,0%
4,5,UpGrad Data Science Prep Content,Upgrad DS+AI/ML,6,3,3,31-07-2022,50%
5,6,UpGrad Data Science Course,Upgrad DS+AI/ML,11,8,3,11-09-2022,73%


Here, you can see that our index columns on the two dataframes -  `S.No.` and `Index`, are separated as separate columns. Hence, to make them our indices, we will use a parameter called __`index_col`__. This will specify which column will be our index. 

Let us see this in action.

In [7]:
topics=pd.read_excel('https://github.com/yashj1301/Python3-UpGrad-UMich/blob/master/Python%203.x/Upgrad/Modules/Module%203%20-%20Python%20for%20Data%20Science/Session%203%20-%20Pandas/Data/Topics.xlsx?raw=true',
                     index_col=0)
topics[:5]

Unnamed: 0_level_0,Name of Topic,Subject,Book Referred
S.No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Probability Theory,Data Analysis and Interpretations,Probability and Statistics for Engineers and S...
2,Random Variables,Data Analysis and Interpretations,Probability and Statistics for Engineers and S...
3,Discrete and Continuous Probability Distributions,Data Analysis and Interpretations,Probability and Statistics for Engineers and S...
4,Normal Distribution,Data Analysis and Interpretations,Probability and Statistics for Engineers and S...
5,Descriptive Statistics and Sampling,Data Analysis and Interpretations,Probability and Statistics for Engineers and S...


In [6]:
schedule=pd.read_csv('https://github.com/yashj1301/Python3-UpGrad-UMich/raw/master/Python%203.x/Upgrad/Modules/Module%203%20-%20Python%20for%20Data%20Science/Session%203%20-%20Pandas/Data/schedule.csv',
                     index_col=0)
schedule

Unnamed: 0_level_0,Name of the Course,Name of the Certification,Total Modules,Modules Completed,Modules remaining,Due date,Status
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Interactivity with Javascript,Web Design for Everybody,4,1,3,11-09-2022,0%
2,Data Science Methodology,IBM Data Science,3,1,2,11-09-2022,0%
3,Ask Questions to Make Data Driven Decisions,Google Data Analytics,4,2,2,11-09-2022,50%
4,Python Data Structures,Python Programming for Everybody,7,0,7,11-09-2022,0%
5,UpGrad Data Science Prep Content,Upgrad DS+AI/ML,6,3,3,31-07-2022,50%
6,UpGrad Data Science Course,Upgrad DS+AI/ML,11,8,3,11-09-2022,73%


See? Our leftmost column became our index by using the keyword argument `index_cols`. It also takes 2 additional keyword arguments - `sep` and `header`. 

The `sep` argument defines the delimiter between the columns of the data, and the `header` arguments asks us if the top row is our header or not. If it is not, then, we use `header=None` otherwise we don't mention it.

Now that we have imported our data, it is time to understand how to access it. 

### Accessing Data inside a DataFrame 

An important concept in Pandas dataframes is the row and column indices. By default, each row is assigned indices starting from `0`, which are represented to the left of the dataframe. 

The first row in the file (csv, text, etc.) is taken as the column header for columns. If a header is not provided (`header = none`), then the case is similar to that of row indices (which start from `0`).

To access data inside this dataframe, __we don't use indexing similar to 2-D arrays__. Instead, we will use indexing similar to the Python Dictionary, the only difference is that __we will take the column names as the indices here__. It will follow the following syntax - 

<b><code>identifier[[column<sub>1</sub>,<sub>column2</sub>,.......column<sub>x</sub>]]</code></b>

Let us see an example.

In [8]:
schedule

Unnamed: 0_level_0,Name of the Course,Name of the Certification,Total Modules,Modules Completed,Modules remaining,Due date,Status
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Interactivity with Javascript,Web Design for Everybody,4,1,3,11-09-2022,0%
2,Data Science Methodology,IBM Data Science,3,1,2,11-09-2022,0%
3,Ask Questions to Make Data Driven Decisions,Google Data Analytics,4,2,2,11-09-2022,50%
4,Python Data Structures,Python Programming for Everybody,7,0,7,11-09-2022,0%
5,UpGrad Data Science Prep Content,Upgrad DS+AI/ML,6,3,3,31-07-2022,50%
6,UpGrad Data Science Course,Upgrad DS+AI/ML,11,8,3,11-09-2022,73%


In [9]:
schedule[['Name of the Course','Total Modules']]

Unnamed: 0_level_0,Name of the Course,Total Modules
Index,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Interactivity with Javascript,4
2,Data Science Methodology,3
3,Ask Questions to Make Data Driven Decisions,4
4,Python Data Structures,7
5,UpGrad Data Science Prep Content,6
6,UpGrad Data Science Course,11


See? Here, we can display as many columns as we want. Now, if we want to access the data inside the dataframe, then we have to use an inbuilt method in Pandas; we cannot use normal indexing here. 

In [10]:
schedule[['Name of the Course','Total Modules']][0]

KeyError: ignored

See? It is showing us KeyError. 

#### Indexing in Pandas DataFrame

Now, to use indexing in Pandas, we use two methods  for __indexing and slicing__ -
- `iloc[x,y]` - Here, we will __follow the indexing syntax for 2-D arrays__. This will return the rows and columns as per their locations, and not their labels. 
- `loc[[x],[y]]` - Here, we will __follow the indexing syntax for dictionaries__, and our indices will be the row and column labels.

Let us see an example.

##### Location Based Indexing

Here, we will use the normal indexing convention, regardless of what the index labels are. This will follow indexing similar to the syntax of 2-D arrays. We will use the `iloc[x,y]` methods here. Both the values of `x` and `y` can be integer, or it can be a list of values as well. Let us see some examples.

In [None]:
topics

Unnamed: 0_level_0,Name of Topic,Subject,Book Referred
S.No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Probability Theory,Data Analysis and Interpretations,Probability and Statistics for Engineers and S...
2,Random Variables,Data Analysis and Interpretations,Probability and Statistics for Engineers and S...
3,Discrete and Continuous Probability Distributions,Data Analysis and Interpretations,Probability and Statistics for Engineers and S...
4,Normal Distribution,Data Analysis and Interpretations,Probability and Statistics for Engineers and S...
5,Descriptive Statistics and Sampling,Data Analysis and Interpretations,Probability and Statistics for Engineers and S...
...,...,...,...
338,Capacitive Sensing and Resonant Drive Circuits,Microsystems - Analysis and Design,Electromechanics and MEMS by T Jones and N Nen...
339,Distributed 1-D and 2-D Capacitive Electromech...,Microsystems - Analysis and Design,Electromechanics and MEMS by T Jones and N Nen...
340,Practical MEMS Devices,Microsystems - Analysis and Design,Electromechanics and MEMS by T Jones and N Nen...
341,Electromechanics of Piezoelectric Elements,Microsystems - Analysis and Design,Electromechanics and MEMS by T Jones and N Nen...


In [None]:
topics.iloc[-3] # this returned the 3rd last row in the table

Name of Topic                               Practical MEMS Devices
Subject                         Microsystems - Analysis and Design
Book Referred    Electromechanics and MEMS by T Jones and N Nen...
Name: 340, dtype: object

In [11]:
topics.iloc[[5,-3]] # this returned the 6th row and the 3rd last row in the table. 
               #You can see the row number in the name parameter given in the output.

Unnamed: 0_level_0,Name of Topic,Subject,Book Referred
S.No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
6,Discrete Data Analysis,Data Analysis and Interpretations,Probability and Statistics for Engineers and S...
340,Practical MEMS Devices,Microsystems - Analysis and Design,Electromechanics and MEMS by T Jones and N Nen...


In [None]:
topics.iloc[3,0] # this returned the 1st column of the 3rd row

'Normal Distribution'

In [12]:
topics.iloc[[3,10,-7],[0,2]] # this returned the 4th, 11th and 7th last rows and the 1st and 3rd columns

Unnamed: 0_level_0,Name of Topic,Book Referred
S.No.,Unnamed: 1_level_1,Unnamed: 2_level_1
4,Normal Distribution,Probability and Statistics for Engineers and S...
11,Quality Control Method,Probability and Statistics for Engineers and S...
336,Capacitive Lumped Parameter Electromechanics,Electromechanics and MEMS by T Jones and N Nen...


##### Label Based Indexing

Here, we will use the label-based indexing convention, and hence, we will use syntax similar to dictionary indexing. For this indexing, we will use the `loc[[x],[y]]` method. Here also, we can either pass single labels, or we can pass a list of labels as arguments. Let us see some examples.

In [16]:
topics.loc[[4,2]] #this returned the rows having row labels 4 and 2 with all columns

Unnamed: 0_level_0,Name of Topic,Subject,Book Referred
S.No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
4,Normal Distribution,Data Analysis and Interpretations,Probability and Statistics for Engineers and S...
2,Random Variables,Data Analysis and Interpretations,Probability and Statistics for Engineers and S...


In [18]:
topics.loc[[4],['Subject']]

Unnamed: 0_level_0,Subject
S.No.,Unnamed: 1_level_1
4,Data Analysis and Interpretations


In [19]:
topics.loc[[4,10,17],['Subject','Book Referred']]

Unnamed: 0_level_0,Subject,Book Referred
S.No.,Unnamed: 1_level_1,Unnamed: 2_level_1
4,Data Analysis and Interpretations,Probability and Statistics for Engineers and S...
10,Data Analysis and Interpretations,Probability and Statistics for Engineers and S...
17,Introduction to Aerospace Engg,Introduction to Flight 8th Ed. By JD Anderson


#### Slicing in DataFrames

Similarily, we can use Slicing in DataFrames using the `iloc[]` method of Pandas. This is a way of extracting rows and columns when either the names are too long, or the number of rows and columns is so huge, that you cannot manually type each column you need.

In [None]:
topics.iloc[9:-9,-1] # returning last column from 10th row to 10th last row

S.No.
10     Probability and Statistics for Engineers and S...
11     Probability and Statistics for Engineers and S...
12     Probability and Statistics for Engineers and S...
13         Introduction to Flight 8th Ed. By JD Anderson
14         Introduction to Flight 8th Ed. By JD Anderson
                             ...                        
329    Functional Analysis, Calculus of Variations an...
330    Functional Analysis, Calculus of Variations an...
331    Functional Analysis, Calculus of Variations an...
332    Functional Analysis, Calculus of Variations an...
333    Functional Analysis, Calculus of Variations an...
Name: Book Referred, Length: 324, dtype: object

In [None]:
topics.iloc[3,:] #returning 3rd row all columns

Name of Topic                                  Normal Distribution
Subject                          Data Analysis and Interpretations
Book Referred    Probability and Statistics for Engineers and S...
Name: 4, dtype: object

If our entry is a string, then we can apply __multilevel indexing__ further here, to extract parts of the string. Let us see an example. 

In [None]:
topics.iloc[3,2][:4] #extracting the first 4 characters from the 3rd entry of second column

'Prob'

In [None]:
topics.iloc[0,1][3:-4] #extracting the characters from 3rd position 
                       #to the 4th last position from the 1st row 2nd column

'a Analysis and Interpretat'

This is how we can do indexing and slicing in Pandas DataFrame. Now, let us see if we can do some __filter indexing__ in dataframes or not.

In [None]:
topics[['Electro' in i for i in topics.iloc[:,0]]] #returning all entries that contain Electro in their Name of Topics

Unnamed: 0_level_0,Name of Topic,Subject,Book Referred
S.No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
334,Introduction to Electromechanical Systems,Microsystems - Analysis and Design,Electromechanics and MEMS by T Jones and N Nen...
336,Capacitive Lumped Parameter Electromechanics,Microsystems - Analysis and Design,Electromechanics and MEMS by T Jones and N Nen...
337,Small-Signal Capacitive Electromechanical Systems,Microsystems - Analysis and Design,Electromechanics and MEMS by T Jones and N Nen...
339,Distributed 1-D and 2-D Capacitive Electromech...,Microsystems - Analysis and Design,Electromechanics and MEMS by T Jones and N Nen...
341,Electromechanics of Piezoelectric Elements,Microsystems - Analysis and Design,Electromechanics and MEMS by T Jones and N Nen...
342,Electromechanics of Magnetic MEMS Devices,Microsystems - Analysis and Design,Electromechanics and MEMS by T Jones and N Nen...


In [20]:
topics[['Anderson' in i for i in topics['Book Referred']]].iloc[:5]

Unnamed: 0_level_0,Name of Topic,Subject,Book Referred
S.No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
13,The First Aeronautical Engineers,Introduction to Aerospace Engg,Introduction to Flight 8th Ed. By JD Anderson
14,Fundamental Thoughts,Introduction to Aerospace Engg,Introduction to Flight 8th Ed. By JD Anderson
15,The Standard Atmosphere,Introduction to Aerospace Engg,Introduction to Flight 8th Ed. By JD Anderson
16,Basic Aerodynamics,Introduction to Aerospace Engg,Introduction to Flight 8th Ed. By JD Anderson
17,Aerodynamic Shapes,Introduction to Aerospace Engg,Introduction to Flight 8th Ed. By JD Anderson


This is how we use filter indexing in DataFrames. Now, before we move to operations, we will learn more about some common methods used in DataFrames.

### Methods in DataFrames

There are a plethora of methods in DataFrames that are extensively used. However, we will be using the following methods very extensively. Some of these are directly coming from the series and the Numpy Arrays. These methods are -

- __`index`__ - This method will return a Pandas Series having the indices of the dataframe. t contains another method inside it called the `name` method, that will return the name of the index. 
- __`columns`__ - This method will return a Pandas Series containing the columns of the dataframe.
- __`shape`__ - This method will return the number of rows and columns of the dataframe. 
- __`size`__ - This method will return the size (rows*columns) of the dataframe.
- __`ndim`__ - This method will return the number of dimensions of the dataframe.

- __`head()`,`tail()`__ - These methods will return the first and last 'n' rows of the dataframe. 
- __`info()`__ - This method will describe the information of the dataframe, such as data types of index and columns, and number of non-null values.
- __`describe()`__ - This function returns the descriptive statistics for the dataframe, that is, the central tendency (mean, median, min, max, etc.), dispersion, etc. It analyses the data and generates output for numeric and non-numeric data types accordingly.
- __`copy()`__ - This method will create a deep copy of the dataframe into a new identifier. 
- __`drop()`__ - This method will delete columns or rows from the dataframe. It takes the list of columns and two keyword arguments - `axis` which defines the axis of the element to be deleted (0 for row and 1 for column), and an optional keyword argument `inplace` which will overwrite the new dataframe to the original one. 
- __`count()`__ - This method will count the number of non-null values in the dataframe. It will return a Pandas Series containing non-null values for each column.
- __`sum(),prod(),mean()`__ - These methods will calculate the arithmetic functions defined, with another data structure of the same length. 
- __`sort_values()`,`sort_index()`__ - These methods will sort the dataframe on the basis of values or index.
- __`between()`__ - This method will return a boolean Panda series that returns true or false if the element inside the dataframe is between two given values. It is applicable only to numerical columns of the dataframe. 
- __`unique()`,`nunique()`,`value_counts()`__ - These methods will carry out the uniqueness utilities in Pandas. These are described already in Pandas Series. 
- __`max()`,`min()`,`idxmax()`,`idxmin()`__ - These methods will return the maximum, minimum and their indices in a dataframe. It is only applicable to numerical columns.
- __`isin()`__ - This method will return a boolean Pandas Series that will show if the value(s) inside the paranthesis are present in the dataframe or not. 

Let us see them in action.

In [None]:
topics.index,topics.index.name

(Int64Index([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,
             ...
             333, 334, 335, 336, 337, 338, 339, 340, 341, 342],
            dtype='int64', name='S.No.', length=342),
 'S.No.')

In [None]:
topics.columns

Index(['Name of Topic', 'Subject', 'Book Referred'], dtype='object')

In [None]:
topics.shape

(342, 3)

In [None]:
topics.size,topics.ndim

(1026, 2)

In [None]:
topics.head(5)

Unnamed: 0_level_0,Name of Topic,Subject,Book Referred
S.No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Probability Theory,Data Analysis and Interpretations,Probability and Statistics for Engineers and S...
2,Random Variables,Data Analysis and Interpretations,Probability and Statistics for Engineers and S...
3,Discrete and Continuous Probability Distributions,Data Analysis and Interpretations,Probability and Statistics for Engineers and S...
4,Normal Distribution,Data Analysis and Interpretations,Probability and Statistics for Engineers and S...
5,Descriptive Statistics and Sampling,Data Analysis and Interpretations,Probability and Statistics for Engineers and S...


In [None]:
topics.tail(4)

Unnamed: 0_level_0,Name of Topic,Subject,Book Referred
S.No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
339,Distributed 1-D and 2-D Capacitive Electromech...,Microsystems - Analysis and Design,Electromechanics and MEMS by T Jones and N Nen...
340,Practical MEMS Devices,Microsystems - Analysis and Design,Electromechanics and MEMS by T Jones and N Nen...
341,Electromechanics of Piezoelectric Elements,Microsystems - Analysis and Design,Electromechanics and MEMS by T Jones and N Nen...
342,Electromechanics of Magnetic MEMS Devices,Microsystems - Analysis and Design,Electromechanics and MEMS by T Jones and N Nen...


In [None]:
topics.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 342 entries, 1 to 342
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Name of Topic  342 non-null    object
 1   Subject        342 non-null    object
 2   Book Referred  342 non-null    object
dtypes: object(3)
memory usage: 18.8+ KB


In [None]:
topics.describe()

Unnamed: 0,Name of Topic,Subject,Book Referred
count,342,342,342
unique,342,24,24
top,Probability Theory,Introduction to Composite Structures,Aerospace Materials and Material Technologies ...
freq,1,26,26


In [None]:
topics_copy=topics.copy()
topics_copy.index.name='SN'
topics_copy.head()

Unnamed: 0_level_0,Name of Topic,Subject,Book Referred
SN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Probability Theory,Data Analysis and Interpretations,Probability and Statistics for Engineers and S...
2,Random Variables,Data Analysis and Interpretations,Probability and Statistics for Engineers and S...
3,Discrete and Continuous Probability Distributions,Data Analysis and Interpretations,Probability and Statistics for Engineers and S...
4,Normal Distribution,Data Analysis and Interpretations,Probability and Statistics for Engineers and S...
5,Descriptive Statistics and Sampling,Data Analysis and Interpretations,Probability and Statistics for Engineers and S...


In [None]:
topics.head()

Unnamed: 0_level_0,Name of Topic,Subject,Book Referred
S.No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Probability Theory,Data Analysis and Interpretations,Probability and Statistics for Engineers and S...
2,Random Variables,Data Analysis and Interpretations,Probability and Statistics for Engineers and S...
3,Discrete and Continuous Probability Distributions,Data Analysis and Interpretations,Probability and Statistics for Engineers and S...
4,Normal Distribution,Data Analysis and Interpretations,Probability and Statistics for Engineers and S...
5,Descriptive Statistics and Sampling,Data Analysis and Interpretations,Probability and Statistics for Engineers and S...


In [None]:
topics_copy.drop(['Subject'],axis=1,inplace=True)
topics_copy.head()

Unnamed: 0_level_0,Name of Topic,Book Referred
SN,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Probability Theory,Probability and Statistics for Engineers and S...
2,Random Variables,Probability and Statistics for Engineers and S...
3,Discrete and Continuous Probability Distributions,Probability and Statistics for Engineers and S...
4,Normal Distribution,Probability and Statistics for Engineers and S...
5,Descriptive Statistics and Sampling,Probability and Statistics for Engineers and S...


In [None]:
topics.count()

Name of Topic    342
Subject          342
Book Referred    342
dtype: int64

In [None]:
topics.nunique()

Name of Topic    342
Subject           24
Book Referred     24
dtype: int64

In [None]:
topics['Subject'].value_counts().sort_index()

Aerodynamics                            20
Aerospace Propulsion                    21
Aerospace Structural Mechanics           6
Aircraft Design                         24
Aircraft Modelling and Simulation        9
Aircraft Propulsion                     18
Compressible Fluid Mechanics            12
Computational Fluid Dynamics (CFD)      10
Continuum Mechanics                      9
Control Theory                          13
Data Analysis and Interpretations       12
Finite Element Method (FEM)             19
Flight Mechanics                         8
Incompressible Fluid Mechanics          19
Introduction to Aerospace Engg          10
Introduction to Composite Structures    26
Microsystems - Analysis and Design       9
Navigation and Guidance                  9
Optimal Control Systems                 22
Solid Mechanics                         10
Spaceflight Mechanics                   12
Thermodynamics and Propulsion           11
Turbo Machines                           9
Vibrations 

In [None]:
topics['Book Referred'].value_counts().max(),topics['Book Referred'].value_counts().min()

(26, 6)

In [41]:
topics[topics['Subject'].isin(['Data Analysis and Interpretations','Control Theory'])]['Subject'].value_counts()

Control Theory                       13
Data Analysis and Interpretations    12
Name: Subject, dtype: int64

All these methods will come in handy. Next, we will learn about the Operations in Pandas DataFrame. 

### Operations in Pandas DataFrame

After you have loaded the data in the dataframes, it is not necessary that they will be usable in the same format. You may have to modify or generate new entries from the existing data to get the desired format. Let’s take a look at the features that the Pandas library offers in this respect.

First, let us import a dataset `Sales` on which we will be performing our operations.


In [89]:
sales=pd.read_excel('https://github.com/yashj1301/Python3-UpGrad-UMich/blob/master/Python%203.x/Upgrad/Modules/Module%203%20-%20Python%20for%20Data%20Science/Session%203%20-%20Pandas/Data/sales.xlsx?raw=true',
                  index_col=1)
sales.head(5)

Unnamed: 0_level_0,Market,No_of_Orders,Profit,Sales
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Western Africa,Africa,251,-12901.51,78476.06
Southern Africa,Africa,85,11768.58,51319.5
North Africa,Africa,182,21643.08,86698.89
Eastern Africa,Africa,110,8013.04,44182.6
Central Africa,Africa,103,15606.3,61689.99


Here, we have our index as the Region. Let us see some information regarding our dataset.

In [44]:
sales.info()

<class 'pandas.core.frame.DataFrame'>
Index: 23 entries, Western Africa to Canada
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Market        23 non-null     object 
 1   No_of_Orders  23 non-null     int64  
 2   Profit        23 non-null     float64
 3   Sales         23 non-null     float64
dtypes: float64(2), int64(1), object(1)
memory usage: 1.4+ KB


In [45]:
sales.describe()

Unnamed: 0,No_of_Orders,Profit,Sales
count,23.0,23.0,23.0
mean,366.478261,28859.944783,206285.108696
std,246.590361,27701.193773,160589.886606
min,37.0,-16766.9,8190.74
25%,211.5,12073.085,82587.475
50%,356.0,20948.84,170416.31
75%,479.5,45882.845,290182.375
max,964.0,82091.27,656637.14


We can perform some basic operations on our dataframes. First, we will look for some basic Arithmetic operations. 

#### Arithmetic Operations

The arithmetic operations `+`,`-`,`*`,`/`,`//`,`**`,`%` can only be performed within their limitations, i.e. apart from multiplication and addition, rest all the operations will be performed only on numerical columns. Let us see an example.

In [46]:
sales.info()

<class 'pandas.core.frame.DataFrame'>
Index: 23 entries, Western Africa to Canada
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Market        23 non-null     object 
 1   No_of_Orders  23 non-null     int64  
 2   Profit        23 non-null     float64
 3   Sales         23 non-null     float64
dtypes: float64(2), int64(1), object(1)
memory usage: 1.4+ KB


Here, we have 3 columns that have numerical data - No of Orders, Profit and Sales. Now, let us perform some basic arithmetic operations on them. 

Note that indexing a dataframe with a single column yields a Pandas Series and multiple columns yields a Pandas DataFrame. Let us see an example now.

In [49]:
type(sales['Sales']),type(sales[['Sales','Profit']])

(pandas.core.series.Series, pandas.core.frame.DataFrame)

First, we will perform some arithmetic operations on our Sales Column. This will return a Pandas Series. 

In [55]:
sales['Sales']/100

Region
Western Africa        784.7606
Southern Africa       513.1950
North Africa          866.9889
Eastern Africa        441.8260
Central Africa        616.8999
Western Asia         1243.1224
Southern Asia        3518.0660
Southeastern Asia    3297.5138
Oceania              4080.0298
Eastern Asia         3153.9077
Central Asia           81.9074
Western Europe       6566.3714
Southern Europe      2157.0393
Northern Europe      2529.6909
Eastern Europe       1082.5893
South America        2107.1049
Central America      4616.7028
Caribbean            1163.3305
Western US           2519.9183
Southern US          1487.7191
Eastern US           2649.7398
Central US           1704.1631
Canada                262.9881
Name: Sales, dtype: float64

See? We can simply perform all the arithmetic operations on such columns, because when we select a column from a DataFrame, it becomes a Pandas Series. 

Now, let us see what would happen to a DataFrame. 

In [56]:
sales[['Profit','Sales']]/100

Unnamed: 0_level_0,Profit,Sales
Region,Unnamed: 1_level_1,Unnamed: 2_level_1
Western Africa,-129.0151,784.7606
Southern Africa,117.6858,513.195
North Africa,216.4308,866.9889
Eastern Africa,80.1304,441.826
Central Africa,156.063,616.8999
Western Asia,-167.669,1243.1224
Southern Asia,679.9876,3518.066
Southeastern Asia,209.4884,3297.5138
Oceania,547.3402,4080.0298
Eastern Asia,728.051,3153.9077


In [57]:
sales[['Market','Profit']]/100

TypeError: ignored

So, here it is clearly visible that __we can perform basic arithmetic operations on the DataFrames__ in the same manner as Pandas Series, but there is a condition - <font color="red">All the columns selected should have a data type that supports the operation</font>. 

#### Logical Operations

In the case of Logical Operations, we can use the bitwise operators `&`,`|` and `!` alongside our comparison operators `>=`,`<=`,`==`,`!=`, but they would also come with limitations. 

The string data type columns won't be eligible with the comparison operators `>=`,`<=` and similarily, indexing cannot be done with the numeric data type columns. 

Let us see some examples.

In [58]:
sales[['Profit']].head()

Unnamed: 0_level_0,Profit
Region,Unnamed: 1_level_1
Western Africa,-12901.51
Southern Africa,11768.58
North Africa,21643.08
Eastern Africa,8013.04
Central Africa,15606.3


In [61]:
posneg=pd.Series(['Positive' if i<0 else 'Negative' for i in sales['Profit']],index=sales.index)
posneg

Region
Western Africa       Positive
Southern Africa      Negative
North Africa         Negative
Eastern Africa       Negative
Central Africa       Negative
Western Asia         Positive
Southern Asia        Negative
Southeastern Asia    Negative
Oceania              Negative
Eastern Asia         Negative
Central Asia         Positive
Western Europe       Negative
Southern Europe      Negative
Northern Europe      Negative
Eastern Europe       Negative
South America        Negative
Central America      Negative
Caribbean            Negative
Western US           Negative
Southern US          Negative
Eastern US           Negative
Central US           Negative
Canada               Negative
dtype: object

In [65]:
(sales['Market']=='Canada') | (sales['Profit']>35000)

Region
Western Africa       False
Southern Africa      False
North Africa         False
Eastern Africa       False
Central Africa       False
Western Asia         False
Southern Asia         True
Southeastern Asia    False
Oceania               True
Eastern Asia          True
Central Asia         False
Western Europe        True
Southern Europe      False
Northern Europe       True
Eastern Europe       False
South America        False
Central America       True
Caribbean            False
Western US            True
Southern US          False
Eastern US            True
Central US           False
Canada               False
dtype: bool

In [68]:
sales['Market']!='Asia Pacific'

Region
Western Africa        True
Southern Africa       True
North Africa          True
Eastern Africa        True
Central Africa        True
Western Asia         False
Southern Asia        False
Southeastern Asia    False
Oceania              False
Eastern Asia         False
Central Asia         False
Western Europe        True
Southern Europe       True
Northern Europe       True
Eastern Europe        True
South America         True
Central America       True
Caribbean             True
Western US            True
Southern US           True
Eastern US            True
Central US            True
Canada                True
Name: Market, dtype: bool

As we can see, the logical operations are returning boolean Pandas series. We can use these logics in filter indexing to filter our dataframe. Now, let us look at some ways to modify a dataframe.

### Modifying a DataFrame

We can perform several modifications in our dataframe. The most basic modifications include adding, removing, or modifying an existing column or row of the dataframe. 

Let us look at some methods we can use for this - 

- __`rename()`__ - This method is used to rename the rows and columns of a DataFrame. It takes two keyword arguments - `index` (for rows), and `columns` (for columns) that each take a dictionary with its key as the current column name and value as the new column name. It takes an optional argument `inplace` which will replace the cureent dataframe with the new one, if `True`.
- __`drop()`__ - This method is used to delete a column or a row from a DataFrame. It takes 3 arguments - a list of the columns or rows to be deleted, an `axis` argument with `0 as rows` and `1 as columns`, and a boolean argument `inplace` which will overwrite the DataFrame with newly created DataFrame after deletion if `True`. 
- __`apply()`__ - This method is used to create a new column, or modify an existing column based on some condition, which is usually specified through the `lambda` functionality. 

Let us see them in action.

##### Renaming Rows and Columns

In [69]:
sales.columns

Index(['Market', 'No_of_Orders', 'Profit', 'Sales'], dtype='object')

In [74]:
sales.rename(columns={'No_of_Orders':'Orders'},inplace=True)

In [75]:
sales.head()

Unnamed: 0_level_0,Market,Orders,Profit,Sales
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Western Africa,Africa,251,-12901.51,78476.06
Southern Africa,Africa,85,11768.58,51319.5
North Africa,Africa,182,21643.08,86698.89
Eastern Africa,Africa,110,8013.04,44182.6
Central Africa,Africa,103,15606.3,61689.99


In [76]:
sales.index

Index(['Western Africa', 'Southern Africa', 'North Africa', 'Eastern Africa',
       'Central Africa', 'Western Asia', 'Southern Asia', 'Southeastern Asia',
       'Oceania', 'Eastern Asia', 'Central Asia', 'Western Europe',
       'Southern Europe', 'Northern Europe', 'Eastern Europe', 'South America',
       'Central America', 'Caribbean', 'Western US', 'Southern US',
       'Eastern US', 'Central US', 'Canada'],
      dtype='object', name='Region')

In [77]:
sales.rename(index={'Western Africa':'Africa W','Central US':'USAC'},inplace=True)
sales.index

Index(['Africa W', 'Southern Africa', 'North Africa', 'Eastern Africa',
       'Central Africa', 'Western Asia', 'Southern Asia', 'Southeastern Asia',
       'Oceania', 'Eastern Asia', 'Central Asia', 'Western Europe',
       'Southern Europe', 'Northern Europe', 'Eastern Europe', 'South America',
       'Central America', 'Caribbean', 'Western US', 'Southern US',
       'Eastern US', 'USAC', 'Canada'],
      dtype='object', name='Region')

##### Deleting Columns and Rows 

Before deleting a column / row in the DataFrame, let us make a copy of it using the `copy()` method. Then, we will delete from the copied DataFrame. 

In [84]:
new_sales=sales.copy()
new_sales.head()

Unnamed: 0_level_0,Market,Orders,Profit,Sales
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Africa W,Africa,251,-12901.51,78476.06
Southern Africa,Africa,85,11768.58,51319.5
North Africa,Africa,182,21643.08,86698.89
Eastern Africa,Africa,110,8013.04,44182.6
Central Africa,Africa,103,15606.3,61689.99


In [85]:
new_sales.drop(['Africa W','Central Africa'],axis=0,inplace=True)
new_sales.index

Index(['Southern Africa', 'North Africa', 'Eastern Africa', 'Western Asia',
       'Southern Asia', 'Southeastern Asia', 'Oceania', 'Eastern Asia',
       'Central Asia', 'Western Europe', 'Southern Europe', 'Northern Europe',
       'Eastern Europe', 'South America', 'Central America', 'Caribbean',
       'Western US', 'Southern US', 'Eastern US', 'USAC', 'Canada'],
      dtype='object', name='Region')

In [86]:
new_sales.drop(['Market','Profit'],axis=1,inplace=True)
new_sales.columns

Index(['Orders', 'Sales'], dtype='object')

##### Modifying an Existing Column

Now, let us modify an existing column. We will add a suffix 'K' after each entry in the sales column, after dividing it by 1000. We will also round the value to 2 decimal points by using the `round()` function, and the `astype()` method. Let us see.

In [91]:
modify_sales=sales.copy()
modify_sales.head()

Unnamed: 0_level_0,Market,No_of_Orders,Profit,Sales
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Western Africa,Africa,251,-12901.51,78476.06
Southern Africa,Africa,85,11768.58,51319.5
North Africa,Africa,182,21643.08,86698.89
Eastern Africa,Africa,110,8013.04,44182.6
Central Africa,Africa,103,15606.3,61689.99


In [98]:
modify_sales['Profit']=round(modify_sales['Profit']/1000,2).astype(str)+' k'
modify_sales.head()

Unnamed: 0_level_0,Market,No_of_Orders,Profit,Sales
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Western Africa,Africa,251,-12.9 k,78476.06
Southern Africa,Africa,85,11.77 k,51319.5
North Africa,Africa,182,21.64 k,86698.89
Eastern Africa,Africa,110,8.01 k,44182.6
Central Africa,Africa,103,15.61 k,61689.99


##### Adding a New Column to the DataFrame

Now, let us learn how to add a new column to the DataFrame. For this, we use the `apply()` method. Here, we will use the `lambda` function to apply our change to the whole DataFrame. Let us see an example. 

In [108]:
sales['Profit / Loss'] = sales['Profit'].apply(lambda x: 'Profit (+)' if x>0 else 'Loss (-)')
sales.head()

Unnamed: 0_level_0,Market,No_of_Orders,Profit,Sales,Profit / Loss
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Western Africa,Africa,251,-12901.51,78476.06,Loss (-)
Southern Africa,Africa,85,11768.58,51319.5,Profit (+)
North Africa,Africa,182,21643.08,86698.89,Profit (+)
Eastern Africa,Africa,110,8013.04,44182.6,Profit (+)
Central Africa,Africa,103,15606.3,61689.99,Profit (+)


If we want a __derived column__, that does <font color="red">not use any conditional operators</font>, then we can simply perform our arithmetic operation, and create a new column. The following example illustrates this.

In [112]:
sales['Revenue']=round((sales['Profit']*sales['No_of_Orders'])/100000,3).astype(str)+'L'
sales.head()

Unnamed: 0_level_0,Market,No_of_Orders,Profit,Sales,Profit / Loss,Revenue
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Western Africa,Africa,251,-12901.51,78476.06,Loss (-),-32.383L
Southern Africa,Africa,85,11768.58,51319.5,Profit (+),10.003L
North Africa,Africa,182,21643.08,86698.89,Profit (+),39.39L
Eastern Africa,Africa,110,8013.04,44182.6,Profit (+),8.814L
Central Africa,Africa,103,15606.3,61689.99,Profit (+),16.074L


This was all about the basics of Pandas DataFrames. We can clearly see how efficient and easy to use they are. Of course, there is a plethore of features to learn, but they are not difficult - just vast. 