---   

<h1 align="center">Introduction to Data Analyst and Data Science for beginners</h1>
<h1 align="center">Lecture no 2.16(Pandas-07)</h1>

---
<h3><div align="right">Ehtisham Sadiq</div></h3>    

<img align="right" width="400" height="400"  src="images/pandas-apps.png"  >

## _Modifying Dataframes Part-II_

In [None]:
# To install this library in Jupyter notebook
#import sys
#!{sys.executable} -m pip install pandas

In [1]:
import pandas as pd
pd.__version__ , pd.__path__

('1.4.2', ['/home/dell/.local/lib/python3.8/site-packages/pandas'])

<h4 align="center">A Dataframe is a two-dimensional labeled data structure with heterogeneously typed columns, having both a row and a column indices.</h4>

<img align="right" width="500" height="500"  src="images/pandas00.png"  >

## Learning agenda of this notebook
- **Recap:**
    - Modifying Column names of Dataframe
    - Modifying Row indices of Dataframe
    - Modifying Data inside a Dataframe (Row-wise, Column-wise, Element-wise)


1. Add a New Column in a Dataframe
2. Delete an Existing Column from a Dataframe
3. Add a New Row in  a Dataframe
4. Delete an Existing Row(s) from a Dataframe
5. Adding a New Column with Conditional Values
6. Deleting Row(s) Based on Specific Condition
7. Delete a Column  Based on Specific Condition
8. Change Datatype of a Pandas Series
9. Sorting dataframes using `df.sort_values()`
10. Sorting dataframes using `df.sort_index()`

##  Read a Sample Dataframe

In [25]:
import pandas as pd
df = pd.read_csv('datasets/groupdata.csv')
df.head()

Unnamed: 0,roll no,name,age,address,session,group,gender,subj1,subj2,scholarship
0,MS01,Rauf,52,Lahore,MORNING,group C,Male,78.3,84.4,5000.0
1,MS02,Arif,51,Islamabad,AFT,group A,Male,70.5,60.5,6000.0
2,MS03,Shaista,35,Karachi,AFTERNOON,group B,Female,64.9,75.1,8500.0
3,MS04,Hadeed,20,Lahore,MOR,group A,Male,82.0,84.3,4000.0
4,MS05,Zara,40,Peshawer,AFT,group D,Female,65.9,72.8,3500.0


In [11]:
# `shape` attribute of a dataframe object return a two value tuple containing rows and columns
# Note the rows count does not include the column labels and column count does not include the row index
df.shape

(16, 10)

In [12]:
# `index` attribute of a dataframe object return the list of row indices and its datatype
df.index

RangeIndex(start=0, stop=16, step=1)

In [13]:
# `columns` attribute of a dataframe object return the list of column labels and its datatype
df.columns

Index(['roll no', 'name', 'age', 'address', 'session', 'group', 'gender',
       'subj1', 'subj2', 'scholarship'],
      dtype='object')

In [14]:
# `dtypes` attribute of a dataframe object return the data type of each column in the dataframe
df.dtypes

roll no         object
name            object
age              int64
address         object
session         object
group           object
gender          object
subj1          float64
subj2          float64
scholarship    float64
dtype: object

## 1. Add a New Column in a Dataframe
- To add a new column in a dataframe, create an appropriate series and then assign it to the dataframe
- Every time a new series is added to a dataframe, its name automatically becomes an attribute of that dataframe.
- It can be a series created from scratch, which can be numbersome if the dataframe has thousands of rows.
- Another common way to add a column is construct a series from the existing data within the dataframe
- Let us understand this with an example

In [26]:
# adding new column using pandas series method
df['sub3']=pd.Series(np.random.randint(50,95,13))
df.head()

Unnamed: 0,roll no,name,age,address,session,group,gender,subj1,subj2,scholarship,sub3
0,MS01,Rauf,52,Lahore,MORNING,group C,Male,78.3,84.4,5000.0,77.0
1,MS02,Arif,51,Islamabad,AFT,group A,Male,70.5,60.5,6000.0,72.0
2,MS03,Shaista,35,Karachi,AFTERNOON,group B,Female,64.9,75.1,8500.0,71.0
3,MS04,Hadeed,20,Lahore,MOR,group A,Male,82.0,84.3,4000.0,89.0
4,MS05,Zara,40,Peshawer,AFT,group D,Female,65.9,72.8,3500.0,81.0


In [34]:
# create a new column to find the students those are eligible for scholarship
df['Eligibility'] = ((df[['subj1','subj2','sub3']].sum(axis=1)/300)*100) > 70
df

Unnamed: 0,roll no,name,age,address,session,group,gender,subj1,subj2,scholarship,sub3,Eligibility
0,MS01,Rauf,52,Lahore,MORNING,group C,Male,78.3,84.4,5000.0,77.0,True
1,MS02,Arif,51,Islamabad,AFT,group A,Male,70.5,60.5,6000.0,72.0,False
2,MS03,Shaista,35,Karachi,AFTERNOON,group B,Female,64.9,75.1,8500.0,71.0,True
3,MS04,Hadeed,20,Lahore,MOR,group A,Male,82.0,84.3,4000.0,89.0,True
4,MS05,Zara,40,Peshawer,AFT,group D,Female,65.9,72.8,3500.0,81.0,True
5,MS06,Mohid,16,Lahore,MORNING,group C,Female,69.3,78.6,,61.0,False
6,MS07,Zobia,40,Sialkot,AFT,group B,Female,90.2,,4000.0,83.0,False
7,MS08,Idrees,51,Multan,MORNING,group D,Male,84.1,76.0,8000.0,78.0,True
8,MS09,Jamil,53,Karachi,AFT,group C,Male,90.5,81.3,3500.0,92.0,True
9,MS10,Shahid,38,Lahore,AFTERNOON,group D,Male,90.5,81.3,3800.0,77.0,True


In [44]:
df.Eligibility=df.Eligibility.replace({True:'Eligible',False:'Not_Eligible'})
# df.head()
df.Eligibility.value_counts()

Eligible        9
Not_Eligible    7
Name: Eligibility, dtype: int64

In [45]:
df.columns

Index(['roll no', 'name', 'age', 'address', 'session', 'group', 'gender',
       'subj1', 'subj2', 'scholarship', 'sub3', 'Eligibility'],
      dtype='object')

In [46]:
df.subj1 + df.subj2

0     162.7
1     131.0
2     140.0
3     166.3
4     138.7
5     147.9
6       NaN
7     160.1
8     171.8
9     171.8
10    171.8
11    171.8
12      NaN
13    160.1
14    171.8
15    171.8
dtype: float64

In [47]:
df.subj1.add(df.subj2)

0     162.7
1     131.0
2     140.0
3     166.3
4     138.7
5     147.9
6       NaN
7     160.1
8     171.8
9     171.8
10    171.8
11    171.8
12      NaN
13    160.1
14    171.8
15    171.8
dtype: float64

In [48]:
ser1 = df.subj1.add(df.subj2, fill_value=0)
ser1

0     162.7
1     131.0
2     140.0
3     166.3
4     138.7
5     147.9
6      90.2
7     160.1
8     171.8
9     171.8
10    171.8
11    171.8
12     76.5
13    160.1
14    171.8
15    171.8
dtype: float64

In [50]:
# On the left side of assignment you must use `[]` operator, while on the right you can use dot operator as well
df['total'] = ser1
df

Unnamed: 0,roll no,name,age,address,session,group,gender,subj1,subj2,scholarship,sub3,Eligibility,total
0,MS01,Rauf,52,Lahore,MORNING,group C,Male,78.3,84.4,5000.0,77.0,Eligible,162.7
1,MS02,Arif,51,Islamabad,AFT,group A,Male,70.5,60.5,6000.0,72.0,Not_Eligible,131.0
2,MS03,Shaista,35,Karachi,AFTERNOON,group B,Female,64.9,75.1,8500.0,71.0,Eligible,140.0
3,MS04,Hadeed,20,Lahore,MOR,group A,Male,82.0,84.3,4000.0,89.0,Eligible,166.3
4,MS05,Zara,40,Peshawer,AFT,group D,Female,65.9,72.8,3500.0,81.0,Eligible,138.7
5,MS06,Mohid,16,Lahore,MORNING,group C,Female,69.3,78.6,,61.0,Not_Eligible,147.9
6,MS07,Zobia,40,Sialkot,AFT,group B,Female,90.2,,4000.0,83.0,Not_Eligible,90.2
7,MS08,Idrees,51,Multan,MORNING,group D,Male,84.1,76.0,8000.0,78.0,Eligible,160.1
8,MS09,Jamil,53,Karachi,AFT,group C,Male,90.5,81.3,3500.0,92.0,Eligible,171.8
9,MS10,Shahid,38,Lahore,AFTERNOON,group D,Male,90.5,81.3,3800.0,77.0,Eligible,171.8


Note that once, nothing appears to happen after you execute a Jupyter notebook cell, that means some processing has been done in the background. Over here, a new column has been added to the dataframe named df. Let us confirm this

In [None]:
df.head(3)

## 2. Delete an Existing Column from a Dataframe
- You can use any of the following ways to delete a column from a dataframe:
    - Use `del df['colname']`, which will remove the column, but will not return it
    - Use `df.pop('colname')` method which will remove that column as well as return the deleted column as a series
    - Use `df.drop()` is a better method than the above two. It can delete more than one columns and is not inplace. Moreover, it can be used to delete rows as well

### a. Option 1: Using `del df['colname']`
- The `del df['colname']` will remove the column without returning it. It is inplace

In [51]:
df.columns

Index(['roll no', 'name', 'age', 'address', 'session', 'group', 'gender',
       'subj1', 'subj2', 'scholarship', 'sub3', 'Eligibility', 'total'],
      dtype='object')

In [52]:
del df['total']

In [53]:
df.head(3)

Unnamed: 0,roll no,name,age,address,session,group,gender,subj1,subj2,scholarship,sub3,Eligibility
0,MS01,Rauf,52,Lahore,MORNING,group C,Male,78.3,84.4,5000.0,77.0,Eligible
1,MS02,Arif,51,Islamabad,AFT,group A,Male,70.5,60.5,6000.0,72.0,Not_Eligible
2,MS03,Shaista,35,Karachi,AFTERNOON,group B,Female,64.9,75.1,8500.0,71.0,Eligible


### b. Option 2: Using `df.pop('colname')`
- The `df.pop('colname')` method will remove the column and will return the Series that has been removed from the dataframe. It is inplace

In [54]:
df.pop('address')

0        Lahore
1     Islamabad
2       Karachi
3        Lahore
4      Peshawer
5        Lahore
6       Sialkot
7        Multan
8       Karachi
9        Lahore
10    Islamabad
11      Karachi
12       Lahore
13       Multan
14      Sialkot
15       Multan
Name: address, dtype: object

In [55]:
df.head(3)

Unnamed: 0,roll no,name,age,session,group,gender,subj1,subj2,scholarship,sub3,Eligibility
0,MS01,Rauf,52,MORNING,group C,Male,78.3,84.4,5000.0,77.0,Eligible
1,MS02,Arif,51,AFT,group A,Male,70.5,60.5,6000.0,72.0,Not_Eligible
2,MS03,Shaista,35,AFTERNOON,group B,Female,64.9,75.1,8500.0,71.0,Eligible


### c. Option 3: Using `df.drop()`
- The `df.drop()` method is used to remove one or more columns and will return a Series or Dataframe object accordingly.

```
df.drop(columns=[---],  axis=1, inplace=False)
```
- If you want to drop more than one columns, pass the names of columns to be deleted as a Python List to the `columns` parameter and to the `axis` argument pass 1. (`axis` argument specifies the direction of operation, which is left to right while deleting columns)
- By default it is not inplace. Most of Pandas methods that return a dataframe has an inplace paremeter with default value set to False. It means the operation will not effect the underlying change

In [56]:
import pandas as pd
df = pd.read_csv('datasets/groupdata.csv')
df.head(3)

Unnamed: 0,roll no,name,age,address,session,group,gender,subj1,subj2,scholarship
0,MS01,Rauf,52,Lahore,MORNING,group C,Male,78.3,84.4,5000.0
1,MS02,Arif,51,Islamabad,AFT,group A,Male,70.5,60.5,6000.0
2,MS03,Shaista,35,Karachi,AFTERNOON,group B,Female,64.9,75.1,8500.0


In [57]:
df.drop(columns='name')

Unnamed: 0,roll no,age,address,session,group,gender,subj1,subj2,scholarship
0,MS01,52,Lahore,MORNING,group C,Male,78.3,84.4,5000.0
1,MS02,51,Islamabad,AFT,group A,Male,70.5,60.5,6000.0
2,MS03,35,Karachi,AFTERNOON,group B,Female,64.9,75.1,8500.0
3,MS04,20,Lahore,MOR,group A,Male,82.0,84.3,4000.0
4,MS05,40,Peshawer,AFT,group D,Female,65.9,72.8,3500.0
5,MS06,16,Lahore,MORNING,group C,Female,69.3,78.6,
6,MS07,40,Sialkot,AFT,group B,Female,90.2,,4000.0
7,MS08,51,Multan,MORNING,group D,Male,84.1,76.0,8000.0
8,MS09,53,Karachi,AFT,group C,Male,90.5,81.3,3500.0
9,MS10,38,Lahore,AFTERNOON,group D,Male,90.5,81.3,3800.0


In [58]:
# Remember axis is the direction of operation, and axis=1 is the column axis that goes from left to right
df.drop(columns=['name', 'age', 'address'], axis=1)

Unnamed: 0,roll no,session,group,gender,subj1,subj2,scholarship
0,MS01,MORNING,group C,Male,78.3,84.4,5000.0
1,MS02,AFT,group A,Male,70.5,60.5,6000.0
2,MS03,AFTERNOON,group B,Female,64.9,75.1,8500.0
3,MS04,MOR,group A,Male,82.0,84.3,4000.0
4,MS05,AFT,group D,Female,65.9,72.8,3500.0
5,MS06,MORNING,group C,Female,69.3,78.6,
6,MS07,AFT,group B,Female,90.2,,4000.0
7,MS08,MORNING,group D,Male,84.1,76.0,8000.0
8,MS09,AFT,group C,Male,90.5,81.3,3500.0
9,MS10,AFTERNOON,group D,Male,90.5,81.3,3800.0


It has just returned the resulting dataframe after removing the columns. No change has made to the original dataframe

In [59]:
df.head(3)

Unnamed: 0,roll no,name,age,address,session,group,gender,subj1,subj2,scholarship
0,MS01,Rauf,52,Lahore,MORNING,group C,Male,78.3,84.4,5000.0
1,MS02,Arif,51,Islamabad,AFT,group A,Male,70.5,60.5,6000.0
2,MS03,Shaista,35,Karachi,AFTERNOON,group B,Female,64.9,75.1,8500.0


Let us repeat the operation, with `inplace=True`. Note this time it will return None. However, changes has been made to the original dataframe 

In [60]:
df.drop(columns=['age', 'address', 'name'], axis=1, inplace=True)

In [61]:
df.head(3)

Unnamed: 0,roll no,session,group,gender,subj1,subj2,scholarship
0,MS01,MORNING,group C,Male,78.3,84.4,5000.0
1,MS02,AFT,group A,Male,70.5,60.5,6000.0
2,MS03,AFTERNOON,group B,Female,64.9,75.1,8500.0


## 3. Add a New Row in  a Dataframe
- To add a new row in a dataframe, create an appropriate dataframe and then use `df.append()` method, which will return a new dataframe with the row added.
```
df.append(other, ignore_index=False)
```

**More on append in next session**

In [62]:
import numpy as np
import pandas as pd
df = pd.read_csv('datasets/groupdata.csv')
df.tail()

Unnamed: 0,roll no,name,age,address,session,group,gender,subj1,subj2,scholarship
11,MS12,Maaz,25,Karachi,AFTERNOON,group C,Male,90.5,81.3,
12,MS13,Mujahid,18,Lahore,MORNING,group D,Male,,76.5,7000.0
13,MS14,Sara,28,Multan,AFTERNOON,group A,Female,84.1,76.0,8000.0
14,MS15,Fatima,33,Sialkot,AFT,group C,Female,90.5,81.3,3500.0
15,MS16,Kakamanna,42,Multan,AFTERNOON,group A,Male,90.5,81.3,3800.0


In [63]:
# Let us create a new dataframe having a single row
newdf = pd.DataFrame(data=[['MS222', 100, 'Kamokey', 'AFT', 'group D', 'Male', 55.0, 55.0, 9999]],
     columns=['roll no', 'age', 'address', 'session', 'group', 'gender','subj1', 'subj2', 'scholarship'])
newdf

Unnamed: 0,roll no,age,address,session,group,gender,subj1,subj2,scholarship
0,MS222,100,Kamokey,AFT,group D,Male,55.0,55.0,9999


In [74]:
# Let us create a new dataframe having a single row (Can always create one having multiple rows). 
# Do note that we have not mentioned the scholarship data value as well as the scholarship column name

newdf = pd.DataFrame(data=[['New Student', 'MS222', 100, 'Kamokey', 'AFT', 'group D', 'Male', 55.0, 55.0]],
     columns=['name', 'roll no', 'age', 'address', 'session', 'group', 'gender','subj1', 'subj2'])


newdf

Unnamed: 0,name,roll no,age,address,session,group,gender,subj1,subj2
0,New Student,MS222,100,Kamokey,AFT,group D,Male,55.0,55.0


Note: The index associated with the only row in above dataframe is 0. Moreover, the sequence of columns is not same (name is coming before rollno)

In [75]:
df = df.append(newdf, ignore_index=True )
df.tail()

  df = df.append(newdf, ignore_index=True )


Unnamed: 0,roll no,name,age,address,session,group,gender,subj1,subj2,scholarship
15,MS16,Kakamanna,42,Multan,AFTERNOON,group A,Male,90.5,81.3,3800.0
16,MS222,New Student,100,Kamokey,AFT,group D,Male,55.0,55.0,
17,MS222,New Student,100,Kamokey,AFT,group D,Male,55.0,55.0,
18,MS222,New Student,100,Kamokey,AFT,group D,Male,55.0,55.0,
19,MS222,New Student,100,Kamokey,AFT,group D,Male,55.0,55.0,


- Note that the due to `ignore_index=True` argument it has been assigned the next available index. Otherwise, the new row will also have row index 0.
- Moreover, note the NaN value under the scholarship column against the new added row
- One last thing, the `df.append()` method do not have `inplace` argument, so you always have to assign the resulting dataframe to the original `df`. 
- Please check this out as to why `df.drop()` has `inplace` argument, while **`df.append()` does not have `inplace` argument.**

In [76]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   roll no      20 non-null     object 
 1   name         20 non-null     object 
 2   age          20 non-null     object 
 3   address      20 non-null     object 
 4   session      20 non-null     object 
 5   group        20 non-null     object 
 6   gender       20 non-null     object 
 7   subj1        19 non-null     float64
 8   subj2        19 non-null     float64
 9   scholarship  14 non-null     float64
dtypes: float64(3), object(7)
memory usage: 1.7+ KB


## 4. Delete an Existing Row(s) from a Dataframe
- The `df.drop()` method is used to remove one or more rows (other than columns) and will return a Series or Dataframe object accordingly.

```
df.drop(index=[---],  axis=0, inplace=False)
```
- If you want to drop more than one rows, pass the row indices to be deleted as a Python List to the `index` parameter and to the `axis` argument pass 0. (`axis` argument specifies the direction of operation, which is top to bottom while deleting rows)
- By default it is not inplace. Most of Pandas methods that return a dataframe has an inplace paremeter with default value set to False. It means the operation will not effect the underlying change

In [77]:
import pandas as pd
df = pd.read_csv('datasets/groupdata.csv')
df.head()

Unnamed: 0,roll no,name,age,address,session,group,gender,subj1,subj2,scholarship
0,MS01,Rauf,52,Lahore,MORNING,group C,Male,78.3,84.4,5000.0
1,MS02,Arif,51,Islamabad,AFT,group A,Male,70.5,60.5,6000.0
2,MS03,Shaista,35,Karachi,AFTERNOON,group B,Female,64.9,75.1,8500.0
3,MS04,Hadeed,20,Lahore,MOR,group A,Male,82.0,84.3,4000.0
4,MS05,Zara,40,Peshawer,AFT,group D,Female,65.9,72.8,3500.0


In [79]:
df.loc[[2,4]]

Unnamed: 0,roll no,name,age,address,session,group,gender,subj1,subj2,scholarship
2,MS03,Shaista,35,Karachi,AFTERNOON,group B,Female,64.9,75.1,8500.0
4,MS05,Zara,40,Peshawer,AFT,group D,Female,65.9,72.8,3500.0


In [80]:
df.drop(index=[2,4], axis=0, inplace = True)

In [81]:
df.head()

Unnamed: 0,roll no,name,age,address,session,group,gender,subj1,subj2,scholarship
0,MS01,Rauf,52,Lahore,MORNING,group C,Male,78.3,84.4,5000.0
1,MS02,Arif,51,Islamabad,AFT,group A,Male,70.5,60.5,6000.0
3,MS04,Hadeed,20,Lahore,MOR,group A,Male,82.0,84.3,4000.0
5,MS06,Mohid,16,Lahore,MORNING,group C,Female,69.3,78.6,
6,MS07,Zobia,40,Sialkot,AFT,group B,Female,90.2,,4000.0


## 5. Adding a New Column with Conditional Values

**Create a Simple Dataframe**

In [82]:
import pandas as pd
df = pd.read_csv('datasets/groupdata.csv')
df.head()

Unnamed: 0,roll no,name,age,address,session,group,gender,subj1,subj2,scholarship
0,MS01,Rauf,52,Lahore,MORNING,group C,Male,78.3,84.4,5000.0
1,MS02,Arif,51,Islamabad,AFT,group A,Male,70.5,60.5,6000.0
2,MS03,Shaista,35,Karachi,AFTERNOON,group B,Female,64.9,75.1,8500.0
3,MS04,Hadeed,20,Lahore,MOR,group A,Male,82.0,84.3,4000.0
4,MS05,Zara,40,Peshawer,AFT,group D,Female,65.9,72.8,3500.0


**Example:** Add a new column `total` that contains sum of marks in `subj1` and `subj2`. Then add a new column that should contain string `"Good"` if `total>150` other wise string `"Bad"`

In [84]:
df['total']= df.subj1.add(df.subj2, fill_value=0)
df

Unnamed: 0,roll no,name,age,address,session,group,gender,subj1,subj2,scholarship,total
0,MS01,Rauf,52,Lahore,MORNING,group C,Male,78.3,84.4,5000.0,162.7
1,MS02,Arif,51,Islamabad,AFT,group A,Male,70.5,60.5,6000.0,131.0
2,MS03,Shaista,35,Karachi,AFTERNOON,group B,Female,64.9,75.1,8500.0,140.0
3,MS04,Hadeed,20,Lahore,MOR,group A,Male,82.0,84.3,4000.0,166.3
4,MS05,Zara,40,Peshawer,AFT,group D,Female,65.9,72.8,3500.0,138.7
5,MS06,Mohid,16,Lahore,MORNING,group C,Female,69.3,78.6,,147.9
6,MS07,Zobia,40,Sialkot,AFT,group B,Female,90.2,,4000.0,90.2
7,MS08,Idrees,51,Multan,MORNING,group D,Male,84.1,76.0,8000.0,160.1
8,MS09,Jamil,53,Karachi,AFT,group C,Male,90.5,81.3,3500.0,171.8
9,MS10,Shahid,38,Lahore,AFTERNOON,group D,Male,90.5,81.3,3800.0,171.8


In [87]:
def grade_check(x):
    if x>150:
        return 'Good'
    else:
        return 'Bad'
df['Grade'] = df.total.apply(grade_check)
df

Unnamed: 0,roll no,name,age,address,session,group,gender,subj1,subj2,scholarship,total,Grade
0,MS01,Rauf,52,Lahore,MORNING,group C,Male,78.3,84.4,5000.0,162.7,Good
1,MS02,Arif,51,Islamabad,AFT,group A,Male,70.5,60.5,6000.0,131.0,Bad
2,MS03,Shaista,35,Karachi,AFTERNOON,group B,Female,64.9,75.1,8500.0,140.0,Bad
3,MS04,Hadeed,20,Lahore,MOR,group A,Male,82.0,84.3,4000.0,166.3,Good
4,MS05,Zara,40,Peshawer,AFT,group D,Female,65.9,72.8,3500.0,138.7,Bad
5,MS06,Mohid,16,Lahore,MORNING,group C,Female,69.3,78.6,,147.9,Bad
6,MS07,Zobia,40,Sialkot,AFT,group B,Female,90.2,,4000.0,90.2,Bad
7,MS08,Idrees,51,Multan,MORNING,group D,Male,84.1,76.0,8000.0,160.1,Good
8,MS09,Jamil,53,Karachi,AFT,group C,Male,90.5,81.3,3500.0,171.8,Good
9,MS10,Shahid,38,Lahore,AFTERNOON,group D,Male,90.5,81.3,3800.0,171.8,Good


In [None]:
df['total'] = df.subj1 + df.subj2
df.head()

In [88]:
list1 = ['Good' if i >=150 else 'Bad' for i in df.total]
list1

['Good',
 'Bad',
 'Bad',
 'Good',
 'Bad',
 'Bad',
 'Bad',
 'Good',
 'Good',
 'Good',
 'Good',
 'Good',
 'Bad',
 'Good',
 'Good',
 'Good']

In [89]:
df['grade'] = ['Good' if i >=150 else 'Bad' for i in df.total]
df.head()

Unnamed: 0,roll no,name,age,address,session,group,gender,subj1,subj2,scholarship,total,Grade,grade
0,MS01,Rauf,52,Lahore,MORNING,group C,Male,78.3,84.4,5000.0,162.7,Good,Good
1,MS02,Arif,51,Islamabad,AFT,group A,Male,70.5,60.5,6000.0,131.0,Bad,Bad
2,MS03,Shaista,35,Karachi,AFTERNOON,group B,Female,64.9,75.1,8500.0,140.0,Bad,Bad
3,MS04,Hadeed,20,Lahore,MOR,group A,Male,82.0,84.3,4000.0,166.3,Good,Good
4,MS05,Zara,40,Peshawer,AFT,group D,Female,65.9,72.8,3500.0,138.7,Bad,Bad


## 6. Deleting Row(s) Based on Specific Condition

In [90]:
df = pd.read_csv('datasets/groupdata.csv')
df.head()

Unnamed: 0,roll no,name,age,address,session,group,gender,subj1,subj2,scholarship
0,MS01,Rauf,52,Lahore,MORNING,group C,Male,78.3,84.4,5000.0
1,MS02,Arif,51,Islamabad,AFT,group A,Male,70.5,60.5,6000.0
2,MS03,Shaista,35,Karachi,AFTERNOON,group B,Female,64.9,75.1,8500.0
3,MS04,Hadeed,20,Lahore,MOR,group A,Male,82.0,84.3,4000.0
4,MS05,Zara,40,Peshawer,AFT,group D,Female,65.9,72.8,3500.0


In [91]:
df['address'] == 'Lahore'

0      True
1     False
2     False
3      True
4     False
5      True
6     False
7     False
8     False
9      True
10    False
11    False
12     True
13    False
14    False
15    False
Name: address, dtype: bool

In [92]:
df[df['address'] == 'Lahore']

Unnamed: 0,roll no,name,age,address,session,group,gender,subj1,subj2,scholarship
0,MS01,Rauf,52,Lahore,MORNING,group C,Male,78.3,84.4,5000.0
3,MS04,Hadeed,20,Lahore,MOR,group A,Male,82.0,84.3,4000.0
5,MS06,Mohid,16,Lahore,MORNING,group C,Female,69.3,78.6,
9,MS10,Shahid,38,Lahore,AFTERNOON,group D,Male,90.5,81.3,3800.0
12,MS13,Mujahid,18,Lahore,MORNING,group D,Male,,76.5,7000.0


In [93]:
df[df['address'] == 'Lahore'].index

Int64Index([0, 3, 5, 9, 12], dtype='int64')

In [94]:
df.drop(index = df[df['address'] == 'Lahore'].index, axis = 0, inplace = True)
df

Unnamed: 0,roll no,name,age,address,session,group,gender,subj1,subj2,scholarship
1,MS02,Arif,51,Islamabad,AFT,group A,Male,70.5,60.5,6000.0
2,MS03,Shaista,35,Karachi,AFTERNOON,group B,Female,64.9,75.1,8500.0
4,MS05,Zara,40,Peshawer,AFT,group D,Female,65.9,72.8,3500.0
6,MS07,Zobia,40,Sialkot,AFT,group B,Female,90.2,,4000.0
7,MS08,Idrees,51,Multan,MORNING,group D,Male,84.1,76.0,8000.0
8,MS09,Jamil,53,Karachi,AFT,group C,Male,90.5,81.3,3500.0
10,MS11,Khurram,35,Islamabad,MOR,group B,Male,90.5,81.3,6000.0
11,MS12,Maaz,25,Karachi,AFTERNOON,group C,Male,90.5,81.3,
13,MS14,Sara,28,Multan,AFTERNOON,group A,Female,84.1,76.0,8000.0
14,MS15,Fatima,33,Sialkot,AFT,group C,Female,90.5,81.3,3500.0


In [95]:
# Let us drop an entire row from the data frame, in which session is 'AFT'
# Get the indices where session == 'AFT' using the .index function
indices = df[df['session'] == 'AFT'].index
indices


Int64Index([1, 4, 6, 8, 14], dtype='int64')

In [96]:
# Pass those indices to the drop method to delete those rows
df.drop(index = indices, inplace = True)

In [97]:
df

Unnamed: 0,roll no,name,age,address,session,group,gender,subj1,subj2,scholarship
2,MS03,Shaista,35,Karachi,AFTERNOON,group B,Female,64.9,75.1,8500.0
7,MS08,Idrees,51,Multan,MORNING,group D,Male,84.1,76.0,8000.0
10,MS11,Khurram,35,Islamabad,MOR,group B,Male,90.5,81.3,6000.0
11,MS12,Maaz,25,Karachi,AFTERNOON,group C,Male,90.5,81.3,
13,MS14,Sara,28,Multan,AFTERNOON,group A,Female,84.1,76.0,8000.0
15,MS16,Kakamanna,42,Multan,AFTERNOON,group A,Male,90.5,81.3,3800.0


## 7. Delete a Column  Based on Specific Condition

In [98]:
df = pd.read_csv('datasets/groupdata.csv')
df

Unnamed: 0,roll no,name,age,address,session,group,gender,subj1,subj2,scholarship
0,MS01,Rauf,52,Lahore,MORNING,group C,Male,78.3,84.4,5000.0
1,MS02,Arif,51,Islamabad,AFT,group A,Male,70.5,60.5,6000.0
2,MS03,Shaista,35,Karachi,AFTERNOON,group B,Female,64.9,75.1,8500.0
3,MS04,Hadeed,20,Lahore,MOR,group A,Male,82.0,84.3,4000.0
4,MS05,Zara,40,Peshawer,AFT,group D,Female,65.9,72.8,3500.0
5,MS06,Mohid,16,Lahore,MORNING,group C,Female,69.3,78.6,
6,MS07,Zobia,40,Sialkot,AFT,group B,Female,90.2,,4000.0
7,MS08,Idrees,51,Multan,MORNING,group D,Male,84.1,76.0,8000.0
8,MS09,Jamil,53,Karachi,AFT,group C,Male,90.5,81.3,3500.0
9,MS10,Shahid,38,Lahore,AFTERNOON,group D,Male,90.5,81.3,3800.0


In [99]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   roll no      16 non-null     object 
 1   name         16 non-null     object 
 2   age          16 non-null     int64  
 3   address      16 non-null     object 
 4   session      16 non-null     object 
 5   group        16 non-null     object 
 6   gender       16 non-null     object 
 7   subj1        15 non-null     float64
 8   subj2        15 non-null     float64
 9   scholarship  14 non-null     float64
dtypes: float64(3), int64(1), object(6)
memory usage: 1.4+ KB


**Example:** Let us drop the column(s) from above data frame, if it has >= than 2 NaN values

In [108]:
# df.drop(df.loc[:,df.isnull().sum() >=2], axis=1)

In [109]:
mylist_mask=df.apply(lambda col: col.isnull().sum() >= 1)
mylist_mask

roll no        False
name           False
age            False
address        False
session        False
group          False
gender         False
subj1           True
subj2           True
scholarship     True
dtype: bool

In [110]:
mylist_names=df.columns[mylist_mask]
mylist_names

Index(['subj1', 'subj2', 'scholarship'], dtype='object')

In [111]:
df.drop(columns=mylist_names, axis=1, inplace=True)

In [112]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   roll no  16 non-null     object
 1   name     16 non-null     object
 2   age      16 non-null     int64 
 3   address  16 non-null     object
 4   session  16 non-null     object
 5   group    16 non-null     object
 6   gender   16 non-null     object
dtypes: int64(1), object(6)
memory usage: 1.0+ KB


In [113]:
# It will delete the Scholarship column
df.drop(columns=df.columns[df.apply(lambda col: col.isnull().sum() >= 2)], axis=1, inplace=True)

In [114]:
# Verify
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   roll no  16 non-null     object
 1   name     16 non-null     object
 2   age      16 non-null     int64 
 3   address  16 non-null     object
 4   session  16 non-null     object
 5   group    16 non-null     object
 6   gender   16 non-null     object
dtypes: int64(1), object(6)
memory usage: 1.0+ KB


In [115]:
df.head(3)

Unnamed: 0,roll no,name,age,address,session,group,gender
0,MS01,Rauf,52,Lahore,MORNING,group C,Male
1,MS02,Arif,51,Islamabad,AFT,group A,Male
2,MS03,Shaista,35,Karachi,AFTERNOON,group B,Female


## 8. Change Datatype of a Pandas Series
- Use the `astype(dtype)` method to cast a pandas object to a specified dtype ``dtype``.

### a. Changing Datatype from `int64` to `float64`

In [116]:
df = pd.read_csv('datasets/groupdata.csv')
df.dtypes

roll no         object
name            object
age              int64
address         object
session         object
group           object
gender          object
subj1          float64
subj2          float64
scholarship    float64
dtype: object

In [117]:
#Suppose we want to change the datatype of `age` column to float64 dtype
df['age'] = df.age.astype(float)
df.dtypes

roll no         object
name            object
age            float64
address         object
session         object
group           object
gender          object
subj1          float64
subj2          float64
scholarship    float64
dtype: object

In [118]:
df.head()

Unnamed: 0,roll no,name,age,address,session,group,gender,subj1,subj2,scholarship
0,MS01,Rauf,52.0,Lahore,MORNING,group C,Male,78.3,84.4,5000.0
1,MS02,Arif,51.0,Islamabad,AFT,group A,Male,70.5,60.5,6000.0
2,MS03,Shaista,35.0,Karachi,AFTERNOON,group B,Female,64.9,75.1,8500.0
3,MS04,Hadeed,20.0,Lahore,MOR,group A,Male,82.0,84.3,4000.0
4,MS05,Zara,40.0,Peshawer,AFT,group D,Female,65.9,72.8,3500.0


### b. Changing Datatype from string to boolean

In [119]:
df = pd.read_csv('datasets/groupdata.csv')
df.head()

Unnamed: 0,roll no,name,age,address,session,group,gender,subj1,subj2,scholarship
0,MS01,Rauf,52,Lahore,MORNING,group C,Male,78.3,84.4,5000.0
1,MS02,Arif,51,Islamabad,AFT,group A,Male,70.5,60.5,6000.0
2,MS03,Shaista,35,Karachi,AFTERNOON,group B,Female,64.9,75.1,8500.0
3,MS04,Hadeed,20,Lahore,MOR,group A,Male,82.0,84.3,4000.0
4,MS05,Zara,40,Peshawer,AFT,group D,Female,65.9,72.8,3500.0


In [120]:
df.gender.str.contains('Male')

0      True
1      True
2     False
3      True
4     False
5     False
6     False
7      True
8      True
9      True
10     True
11     True
12     True
13    False
14    False
15     True
Name: gender, dtype: bool

In [121]:
df.gender.str.contains('Male').astype(int)

0     1
1     1
2     0
3     1
4     0
5     0
6     0
7     1
8     1
9     1
10    1
11    1
12    1
13    0
14    0
15    1
Name: gender, dtype: int64

In [122]:
df['gender'] = df.gender.str.contains('Male').astype(int)

In [123]:
df

Unnamed: 0,roll no,name,age,address,session,group,gender,subj1,subj2,scholarship
0,MS01,Rauf,52,Lahore,MORNING,group C,1,78.3,84.4,5000.0
1,MS02,Arif,51,Islamabad,AFT,group A,1,70.5,60.5,6000.0
2,MS03,Shaista,35,Karachi,AFTERNOON,group B,0,64.9,75.1,8500.0
3,MS04,Hadeed,20,Lahore,MOR,group A,1,82.0,84.3,4000.0
4,MS05,Zara,40,Peshawer,AFT,group D,0,65.9,72.8,3500.0
5,MS06,Mohid,16,Lahore,MORNING,group C,0,69.3,78.6,
6,MS07,Zobia,40,Sialkot,AFT,group B,0,90.2,,4000.0
7,MS08,Idrees,51,Multan,MORNING,group D,1,84.1,76.0,8000.0
8,MS09,Jamil,53,Karachi,AFT,group C,1,90.5,81.3,3500.0
9,MS10,Shahid,38,Lahore,AFTERNOON,group D,1,90.5,81.3,3800.0


In [124]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   roll no      16 non-null     object 
 1   name         16 non-null     object 
 2   age          16 non-null     int64  
 3   address      16 non-null     object 
 4   session      16 non-null     object 
 5   group        16 non-null     object 
 6   gender       16 non-null     int64  
 7   subj1        15 non-null     float64
 8   subj2        15 non-null     float64
 9   scholarship  14 non-null     float64
dtypes: float64(3), int64(2), object(5)
memory usage: 1.4+ KB


## 9. Sorting dataframes using `df.sort_values()`

>Pandas data frame has two useful functions. **`df.sort_values()`** to sort by values of one or more columns and **`df.sort_index()`** to sort by the index. Each of these functions come with numerous options, like sorting in specific order (ascending or descending), sorting in place, sorting with missing values, sorting by specific algorithm etc.
- The `df.sort_values()` function sort by the values along either axis.
```
df.sort_values(by,axis=0,ascending=True,inplace=False,kind='quicksort',na_position='last',ignore_index=False)
```
Where,
-  `by`: str or list of str to sort
-  `axis`: If `axis` is 0 or 'index' then 'by' may contain index levels and/or column labels. If `axis` is 1 or 'columns' then 'by' may contain column levels and/or index labels.
- `ascending`: if True then ascending and if False then descending
- `inplace`:  If True, perform operation in-place.
- `kind`: {'quicksort', 'mergesort', 'heapsort', 'stable'}, default 'quicksort'. This option is only applied when sorting on a single column or label.
- `na_position`: If first then puts NaNs at the beginning. Default is last
- `ignore_index`: If True, the resulting axis will be labeled 0, 1, …, n - 1. Default False

In [125]:
import pandas as pd
df = pd.DataFrame({
    'roll_no': [ 102, 101, 104, 103, 105],
    'name' : ['Kamal', 'Saima', 'Jamal', 'Shaikh', 'Farzana'],
    'gender' : ['M', 'F', 'M', 'M', 'F'],
    'grade'  : ['A', 'A', 'B', 'B', 'A'],
    'marks'  : [ 21,  23,  12,  14,  20],
    'city' : ['Lahore', 'Peshawer', 'Lahore', 'Karachi', 'Peshawer']
})
df

Unnamed: 0,roll_no,name,gender,grade,marks,city
0,102,Kamal,M,A,21,Lahore
1,101,Saima,F,A,23,Peshawer
2,104,Jamal,M,B,12,Lahore
3,103,Shaikh,M,B,14,Karachi
4,105,Farzana,F,A,20,Peshawer


### a. Sorting by Single Column

In [126]:
# Let us sort the data by grade column
# By default the sorting is done in ascending order and is not inplace
df1 = df.sort_values(by=['grade'])
df1

Unnamed: 0,roll_no,name,gender,grade,marks,city
0,102,Kamal,M,A,21,Lahore
1,101,Saima,F,A,23,Peshawer
4,105,Farzana,F,A,20,Peshawer
2,104,Jamal,M,B,12,Lahore
3,103,Shaikh,M,B,14,Karachi


- Note in above output, we have sorted the data based on the `grade` column. You can observe that some of the students with higher marks are ranked lower.
- We want to sort the data based on both grades and marks.

### b. Sorting by Multiple Columns

In [127]:
# sort the dataframe
df2 = df.sort_values(by=['grade','marks'])
df2

Unnamed: 0,roll_no,name,gender,grade,marks,city
4,105,Farzana,F,A,20,Peshawer
0,102,Kamal,M,A,21,Lahore
1,101,Saima,F,A,23,Peshawer
2,104,Jamal,M,B,12,Lahore
3,103,Shaikh,M,B,14,Karachi


- Note that the data is first sorted by grade, and then within grade it is sorted by marks
- But still the problem is not solved. Actually we want to sort by grade in ascending order and by marks in descending order.


In [128]:
df3 = df.sort_values(by=['grade','marks'], ascending=[True,False])
df3

Unnamed: 0,roll_no,name,gender,grade,marks,city
1,101,Saima,F,A,23,Peshawer
0,102,Kamal,M,A,21,Lahore
4,105,Farzana,F,A,20,Peshawer
3,103,Shaikh,M,B,14,Karachi
2,104,Jamal,M,B,12,Lahore


### c. Reset the Index (if you want)
- After you sort your dataset, you can observe that the index is also shuffled according to the sorting. If we want to reset the index we use `reset_index()` function.


In [129]:
df3.reset_index()

Unnamed: 0,index,roll_no,name,gender,grade,marks,city
0,1,101,Saima,F,A,23,Peshawer
1,0,102,Kamal,M,A,21,Lahore
2,4,105,Farzana,F,A,20,Peshawer
3,3,103,Shaikh,M,B,14,Karachi
4,2,104,Jamal,M,B,12,Lahore


- Observe that now it has created another column 'index' which is the previous index. 
- If you want to remove this just pass the parameter `drop = True` and also `inplace = True` to save the state.

In [130]:
df3.reset_index(inplace=True, drop=True)

In [131]:
df3

Unnamed: 0,roll_no,name,gender,grade,marks,city
0,101,Saima,F,A,23,Peshawer
1,102,Kamal,M,A,21,Lahore
2,105,Farzana,F,A,20,Peshawer
3,103,Shaikh,M,B,14,Karachi
4,104,Jamal,M,B,12,Lahore


### d. Role of NaN Values in Sorting

In [132]:
df = pd.read_csv('datasets/groupdata.csv')
df

Unnamed: 0,roll no,name,age,address,session,group,gender,subj1,subj2,scholarship
0,MS01,Rauf,52,Lahore,MORNING,group C,Male,78.3,84.4,5000.0
1,MS02,Arif,51,Islamabad,AFT,group A,Male,70.5,60.5,6000.0
2,MS03,Shaista,35,Karachi,AFTERNOON,group B,Female,64.9,75.1,8500.0
3,MS04,Hadeed,20,Lahore,MOR,group A,Male,82.0,84.3,4000.0
4,MS05,Zara,40,Peshawer,AFT,group D,Female,65.9,72.8,3500.0
5,MS06,Mohid,16,Lahore,MORNING,group C,Female,69.3,78.6,
6,MS07,Zobia,40,Sialkot,AFT,group B,Female,90.2,,4000.0
7,MS08,Idrees,51,Multan,MORNING,group D,Male,84.1,76.0,8000.0
8,MS09,Jamil,53,Karachi,AFT,group C,Male,90.5,81.3,3500.0
9,MS10,Shahid,38,Lahore,AFTERNOON,group D,Male,90.5,81.3,3800.0


In [133]:
# If there is a missing value NaN, by default it is listed at the end when using sort_values function
# Regardless of the sorting order (Ascending or Descending)
df.sort_values(by='scholarship')

Unnamed: 0,roll no,name,age,address,session,group,gender,subj1,subj2,scholarship
4,MS05,Zara,40,Peshawer,AFT,group D,Female,65.9,72.8,3500.0
8,MS09,Jamil,53,Karachi,AFT,group C,Male,90.5,81.3,3500.0
14,MS15,Fatima,33,Sialkot,AFT,group C,Female,90.5,81.3,3500.0
9,MS10,Shahid,38,Lahore,AFTERNOON,group D,Male,90.5,81.3,3800.0
15,MS16,Kakamanna,42,Multan,AFTERNOON,group A,Male,90.5,81.3,3800.0
3,MS04,Hadeed,20,Lahore,MOR,group A,Male,82.0,84.3,4000.0
6,MS07,Zobia,40,Sialkot,AFT,group B,Female,90.2,,4000.0
0,MS01,Rauf,52,Lahore,MORNING,group C,Male,78.3,84.4,5000.0
1,MS02,Arif,51,Islamabad,AFT,group A,Male,70.5,60.5,6000.0
10,MS11,Khurram,35,Islamabad,MOR,group B,Male,90.5,81.3,6000.0


In [134]:
# If the argument na_position='first', it will be listed at the top.
df.sort_values(by=['scholarship'], na_position='first')

Unnamed: 0,roll no,name,age,address,session,group,gender,subj1,subj2,scholarship
5,MS06,Mohid,16,Lahore,MORNING,group C,Female,69.3,78.6,
11,MS12,Maaz,25,Karachi,AFTERNOON,group C,Male,90.5,81.3,
4,MS05,Zara,40,Peshawer,AFT,group D,Female,65.9,72.8,3500.0
8,MS09,Jamil,53,Karachi,AFT,group C,Male,90.5,81.3,3500.0
14,MS15,Fatima,33,Sialkot,AFT,group C,Female,90.5,81.3,3500.0
9,MS10,Shahid,38,Lahore,AFTERNOON,group D,Male,90.5,81.3,3800.0
15,MS16,Kakamanna,42,Multan,AFTERNOON,group A,Male,90.5,81.3,3800.0
3,MS04,Hadeed,20,Lahore,MOR,group A,Male,82.0,84.3,4000.0
6,MS07,Zobia,40,Sialkot,AFT,group B,Female,90.2,,4000.0
0,MS01,Rauf,52,Lahore,MORNING,group C,Male,78.3,84.4,5000.0


>- Checkout the `df.nlargest()` method which return the first `n` rows ordered by `columns` in descending order.
>- Checkout the `df.nsmallest()` method which return the first `n` rows ordered by `columns` in ascending order.

In [136]:
df.nsmallest(3, 'scholarship')

Unnamed: 0,roll no,name,age,address,session,group,gender,subj1,subj2,scholarship
4,MS05,Zara,40,Peshawer,AFT,group D,Female,65.9,72.8,3500.0
8,MS09,Jamil,53,Karachi,AFT,group C,Male,90.5,81.3,3500.0
14,MS15,Fatima,33,Sialkot,AFT,group C,Female,90.5,81.3,3500.0


## 10. Sorting dataframes using `df.sort_index()`
> We have observed while using `df.sort_values()`, by default the sorting is performed in the vertical direction. If you want to sort in the row direction, we can set the`axis` argument of  `df.sort_values()` method to 1, which is by default set to zero. However, it may cause problems when a number and a string are mixed

- So to sort a dataframe in the horizontal direction, we normally use **`df.sort_index()`** method.
```
df.sort_index(axis=0,ascending=True,inplace=False,kind='quicksort',na_position='last',ignore_index=False)
```
Where,
-  `axis`: The axis along which to sort. The value 0 identifies the rows, and 1 identifies the columns. (default is 0)
- `ascending`: If True then ascending and If False then descending
- `inplace`:  If True, perform operation in-place.
- `kind`: {'quicksort', 'mergesort', 'heapsort', 'stable'}, default 'quicksort'. This option is only applied when sorting on a single column or label.
- `na_position`: If first then puts NaNs at the beginning. Default is last
- `ignore_index`: If True, the resulting axis will be labeled 0, 1, …, n - 1. Default False

In [137]:
# Let us create a simple data frame
import pandas as pd
df = pd.DataFrame({
    'roll_no': [ 102, 101, 104, 105, 103],
    'name' : ['Kamal', 'Saima', 'Jamal','Farzana', 'Shaikh'],
    'gender' : ['M', 'F', 'M', 'M', 'F'],
    'grade'  : ['A', 'A', 'B', 'B', 'A'],
    'marks'  : [ 21,  23,  12,  14,  20],
    'city' : ['Lahore', 'Peshawer', 'Lahore', 'Karachi', 'Peshawer']
})
df

Unnamed: 0,roll_no,name,gender,grade,marks,city
0,102,Kamal,M,A,21,Lahore
1,101,Saima,F,A,23,Peshawer
2,104,Jamal,M,B,12,Lahore
3,105,Farzana,M,B,14,Karachi
4,103,Shaikh,F,A,20,Peshawer


### a. Sort by Column Labels
- By passing the axis argument with a value 0 or 1, the sorting can be done on the column labels. By default, axis=0

In [138]:
df1 = df.sort_index(axis=1)
df1

Unnamed: 0,city,gender,grade,marks,name,roll_no
0,Lahore,M,A,21,Kamal,102
1,Peshawer,F,A,23,Saima,101
2,Lahore,M,B,12,Jamal,104
3,Karachi,M,B,14,Farzana,105
4,Peshawer,F,A,20,Shaikh,103


### b. Sort by Index
- The first question that might come in your mind, is that why do we need to sort by index. We can see in above dataframe the row indices are in numeric order. And if the row indices are disturbed due to may be sorting by the values of some column, we can use `reset_index()` method to make the row indices again in increasing numeric order.
<br><br>
- To understand this, let us follow following three steps:
    - Set the `roll_no` column as index
    - Call `sort_index()` with axis=0
    - Call `reset_index()`

In [139]:
df

Unnamed: 0,roll_no,name,gender,grade,marks,city
0,102,Kamal,M,A,21,Lahore
1,101,Saima,F,A,23,Peshawer
2,104,Jamal,M,B,12,Lahore
3,105,Farzana,M,B,14,Karachi
4,103,Shaikh,F,A,20,Peshawer


**Let us sort by roll_no**

In [140]:
# Lets us set the roll_no column as index
df1 = df.set_index(["roll_no"])
df1

Unnamed: 0_level_0,name,gender,grade,marks,city
roll_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
102,Kamal,M,A,21,Lahore
101,Saima,F,A,23,Peshawer
104,Jamal,M,B,12,Lahore
105,Farzana,M,B,14,Karachi
103,Shaikh,F,A,20,Peshawer


>Note that the implicit index collumn is dropped and the roll_no column has become the index of this dataframe.

In [141]:
# sort the datframe by index 
df2 = df1.sort_index(axis=0)
df2

Unnamed: 0_level_0,name,gender,grade,marks,city
roll_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
101,Saima,F,A,23,Peshawer
102,Kamal,M,A,21,Lahore
103,Shaikh,F,A,20,Peshawer
104,Jamal,M,B,12,Lahore
105,Farzana,M,B,14,Karachi


In [142]:
# After sort you can reset the index if you want
df3 = df2.reset_index(drop=False)
df3

Unnamed: 0,roll_no,name,gender,grade,marks,city
0,101,Saima,F,A,23,Peshawer
1,102,Kamal,M,A,21,Lahore
2,103,Shaikh,F,A,20,Peshawer
3,104,Jamal,M,B,12,Lahore
4,105,Farzana,M,B,14,Karachi


## Check Your Concepts:
- What is Pandas?

# Pandas - Assignment no 07
- Here is link of [Pandas - Assignment no 07]()