## **PANDAS: INTRODUCTION**
> It is often said that 80% of data analysis is spent on the data cleaning and preparing data. To get a handle on the problem, this section focuses on a small, but important aspect of data manipulation and cleaning with Pandas.
### **Data Structures**
**There are two data structures are there in Pandas -**<br>
* **Series -** It is one-dimensional labeled array capable of holding any data type (integer, strings, floating point numbers, Python objects etc.) of data. The axis is collectively referred to as index.

* **Data Frame -** It is a two-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL Table or a Series of objects.

### **Series Data Structure:**
**pandas.core.series.Series(data, index, dtype, copy)**<br>
* **data -** data takes various forms like ndarray, list, constants, dictionary etc.<br>
* **index -** it is unique and hashable for easy identification.<br>
* **dtype -** it is for data type.<br>
* **copy -** copy data, and its default value is False. It only affects for Series on one dimensional ndarray inputs.

In [1]:
# importing required modules
import pandas as pd
import numpy as np

In [7]:
# creating empty Series
import warnings
warnings.filterwarnings('ignore')
s = pd.Series()
print (s, type(s))

Series([], dtype: float64) <class 'pandas.core.series.Series'>


In [4]:
# create a Series from a ndarray
arr_data = np.array(['apple', 'banana', 'cherry', 'pineapple'])
s = pd.Series(data = arr_data)
print (s, type(s), s[0], s[3])

0        apple
1       banana
2       cherry
3    pineapple
dtype: object <class 'pandas.core.series.Series'> apple pineapple


In [12]:
arr_data = np.array([100, 300, 200, 600, 500])
s = pd.Series(arr_data, copy = False)
s[0] = 999; arr_data[2] = 888
print (arr_data, type(arr_data), "\n", s, type(s))

arr_data = np.array([100, 300, 200, 600, 500])
s = pd.Series(arr_data, copy = True)
s[0] = 999; arr_data[2] = 888
print (arr_data, type(arr_data), "\n", s, type(s))

[999 300 888 600 500] <class 'numpy.ndarray'> 
 0    999
1    300
2    888
3    600
4    500
dtype: int32 <class 'pandas.core.series.Series'>
[100 300 888 600 500] <class 'numpy.ndarray'> 
 0    999
1    300
2    200
3    600
4    500
dtype: int32 <class 'pandas.core.series.Series'>


In [15]:
arr_data = np.array(['apple', 'banana', 'cherry', 'pineapple'])
print (arr_data, type(arr_data))

s = pd.Series(data = arr_data, index = [100, 101, 102, 103])
print (s)
print (s[100], type(s[100]), s[103], type(s[103]))

['apple' 'banana' 'cherry' 'pineapple'] <class 'numpy.ndarray'>
100        apple
101       banana
102       cherry
103    pineapple
dtype: object
apple <class 'str'> pineapple <class 'str'>


In [19]:
s = pd.Series(data = arr_data, index = [100, 101, 100, 103])
print (s)
print (s[100], type(s[100]), s[103], type(s[103]))

100        apple
101       banana
100       cherry
103    pineapple
dtype: object
100     apple
100    cherry
dtype: object <class 'pandas.core.series.Series'> pineapple <class 'str'>


In [17]:
arr_data = np.array(['apple', 'banana', 'cherry', 'pineapple'])
s = pd.Series(data = arr_data, index = ['fruit-1', 'fruit-2', 'fruit-3', 'fruit-4'])
print (s)
print (s['fruit-1'], s[0], s['fruit-3'], s[2])

fruit-1        apple
fruit-2       banana
fruit-3       cherry
fruit-4    pineapple
dtype: object
apple apple cherry cherry


In [21]:
# create a Series from a dictionary
dict_data = {'apple':100, 'banana':202, 'coconut':450, 'mango':435}
s = pd.Series(dict_data)
print (s)
s = pd.Series(dict_data, index = ['banana', 'mango', 'apple', 'coconut'])
print (s)

apple      100
banana     202
coconut    450
mango      435
dtype: int64
banana     202
mango      435
apple      100
coconut    450
dtype: int64


In [24]:
dict_data = {'apple':100, 'banana':202, 'coconut':450, 'mango':435}
s = pd.Series(dict_data, index = ['banana', 'mango', 'apple', 'coconut'])
print (s)
s = pd.Series(data = dict_data, index = 
                    ['banana', 'lime', 'coconut', 'mango', 'guava', 'apple', 'mango', 'apple', 'coconut'])
print (s)
print (s['banana'], s['lime'], s[4], s[5])

banana     202
mango      435
apple      100
coconut    450
dtype: int64
banana     202.0
lime         NaN
coconut    450.0
mango      435.0
guava        NaN
apple      100.0
mango      435.0
apple      100.0
coconut    450.0
dtype: float64
202.0 nan nan 100.0


In [25]:
# create a Series from a scalar
s = pd.Series(5, index = [0, 1, 2, 3, 4])
print(s)
s = pd.Series(5, index = [0, 1, 2, 0, 1, 2])
print(s)

0    5
1    5
2    5
3    5
4    5
dtype: int64
0    5
1    5
2    5
0    5
1    5
2    5
dtype: int64


In [33]:
# Create a Series from a list
s = pd.Series(data = [101, 303, 202, 404, 505], index = ['red', 'blue', 'brown', 'black', 'silver'])
print (s)

red       101
blue      303
brown     202
black     404
silver    505
dtype: int64


In [33]:
# Create a Series from a list
s = pd.Series(data = [101, 303, 202, 404, 505], index = ['red', 'blue', 'brown', 'black', 'silver'])
print (s)

red       101
blue      303
brown     202
black     404
silver    505
dtype: int64


#### Data Frame Data Structure:

#### Create DataFrame

In [7]:
data_dict = {'emp_name':['Amal', 'Kamal', 'Bimal', 'Shyamal'], 'emp_age':[34, 35, 45, 43]}
df = pd.DataFrame(data_dict)
df

Unnamed: 0,emp_name,emp_age
0,Amal,34
1,Kamal,35
2,Bimal,45
3,Shyamal,43


In [8]:
data_dict = {'emp_name':['Amal', 'Kamal', 'Bimal', 'Shyamal'], 'emp_age':[34, 35, 45, 43]}
df = pd.DataFrame(data = data_dict)
df

Unnamed: 0,emp_name,emp_age
0,Amal,34
1,Kamal,35
2,Bimal,45
3,Shyamal,43


In [17]:
data_dict = {'emp_name':['Amal', 'Kamal', 'Bimal', 'Shyamal'], 'emp_age':[34, 35, 45, 43]}
emp_id = [100, 101, 102, 103]
df = pd.DataFrame(data = data_dict, index = emp_id)
df

Unnamed: 0,emp_name,emp_age
100,Amal,34
101,Kamal,35
102,Bimal,45
103,Shyamal,43


In [18]:
data_dict = {'emp_name':['Amal', 'Kamal', 'Bimal', 'Shyamal'], 'emp_age':[34, 35, 45, 43]}
emp_id = [100, 101, 102, 103]
df = pd.DataFrame(data = data_dict, index = emp_id)
print (df)
df = df.reset_index()
df

    emp_name  emp_age
100     Amal       34
101    Kamal       35
102    Bimal       45
103  Shyamal       43


Unnamed: 0,index,emp_name,emp_age
0,100,Amal,34
1,101,Kamal,35
2,102,Bimal,45
3,103,Shyamal,43


In [19]:
user_data = [['alice', 19, 'F', 'student'], ['john', 26, 'M', 'student']]  # list of lists
user1 = pd.DataFrame(user_data)
user1

Unnamed: 0,0,1,2,3
0,alice,19,F,student
1,john,26,M,student


In [18]:
user_data = [['alice', 19, 'F', 'student'], ['john', 26, 'M', 'student']]  # list of lists
user_columns = ['name', 'age', 'gender', 'job']
user1 = pd.DataFrame(data = user_data, columns = user_columns)
user1

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student


In [19]:
user_data = dict(name = ['eric', 'paul'], age = [22, 58], gender = ['M', 'F'], job = ['student', 'manager'])
print (user_data)
user2 = pd.DataFrame(data = user_data)
user2

{'name': ['eric', 'paul'], 'age': [22, 58], 'gender': ['M', 'F'], 'job': ['student', 'manager']}


Unnamed: 0,name,age,gender,job
0,eric,22,M,student
1,paul,58,F,manager


In [4]:
user_data = {'name': ['peter', 'julie'], 'age': [33, 44], 'gender': ['M', 'F'], 'job': ['engineer', 'scientist']}
user3 = pd.DataFrame(data = user_data)
user3

Unnamed: 0,name,age,gender,job
0,peter,33,M,engineer
1,julie,44,F,scientist


#### Concatenate DataFrame

In [20]:
print (user1)
print ()
print (user2)
print ()
print (user3)

    name  age gender      job
0  alice   19      F  student
1   john   26      M  student

   name  age gender      job
0  eric   22      M  student
1  paul   58      F  manager

    name  age gender        job
0  peter   33      M   engineer
1  julie   44      F  scientist


In [8]:
users = user1.append(user2)
users

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
0,eric,22,M,student
1,paul,58,F,manager


In [9]:
users = user1.append(user2, ignore_index = True)
users

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
2,eric,22,M,student
3,paul,58,F,manager


In [10]:
users = users.append(user3, ignore_index = True)
users

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
2,eric,22,M,student
3,paul,58,F,manager
4,peter,33,M,engineer
5,julie,44,F,scientist


In [11]:
users = user1.append(user2).append(user3, ignore_index = True)
users

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
2,eric,22,M,student
3,paul,58,F,manager
4,peter,33,M,engineer
5,julie,44,F,scientist


In [21]:
users = pd.concat([user1, user2, user3])
users

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
0,eric,22,M,student
1,paul,58,F,manager
0,peter,33,M,engineer
1,julie,44,F,scientist


In [24]:
users = pd.concat([user1, user2, user3], ignore_index = True)
users

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
2,eric,22,M,student
3,paul,58,F,manager
4,peter,33,M,engineer
5,julie,44,F,scientist


### DataFrame to NumPy ndarray

In [26]:
user1

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student


In [29]:
arr_data = user1[['name', 'job']].to_numpy()
print (arr_data, type(arr_data))

[['alice' 'student']
 ['john' 'student']] <class 'numpy.ndarray'>


In [30]:
arr_data = user1.to_numpy()
print (arr_data, type(arr_data), arr_data.ndim, arr_data.size)

[['alice' 19 'F' 'student']
 ['john' 26 'M' 'student']] <class 'numpy.ndarray'> 2 8


In [31]:
arr_data = np.array(user1)
print (arr_data, type(arr_data))

[['alice' 19 'F' 'student']
 ['john' 26 'M' 'student']] <class 'numpy.ndarray'>


## Join DataFrame

In [32]:
dict_data = dict(name = ['alice', 'john', 'eric', 'julie', 'anderson'], height = [165, 180, 175, 171, 169])
user4 = pd.DataFrame(data = dict_data)
user4

Unnamed: 0,name,height
0,alice,165
1,john,180
2,eric,175
3,julie,171
4,anderson,169


In [33]:
users

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
2,eric,22,M,student
3,paul,58,F,manager
4,peter,33,M,engineer
5,julie,44,F,scientist


In [34]:
# inner join: All common rows from both data frames
merge_inner = pd.merge(users, user4, on = "name", how = "inner")
merge_inner

Unnamed: 0,name,age,gender,job,height
0,alice,19,F,student,165
1,john,26,M,student,180
2,eric,22,M,student,175
3,julie,44,F,scientist,171


In [36]:
merge_inner = pd.merge(users, user4, on = "name")  # by default inner join will take place
merge_inner

Unnamed: 0,name,age,gender,job,height
0,alice,19,F,student,165
1,john,26,M,student,180
2,eric,22,M,student,175
3,julie,44,F,scientist,171


In [37]:
# outer join: All rows from both data frames
merge_outer = pd.merge(users, user4, on = "name", how = "outer")
merge_outer

Unnamed: 0,name,age,gender,job,height
0,alice,19.0,F,student,165.0
1,john,26.0,M,student,180.0
2,eric,22.0,M,student,175.0
3,paul,58.0,F,manager,
4,peter,33.0,M,engineer,
5,julie,44.0,F,scientist,171.0
6,anderson,,,,169.0


In [38]:
# left outer join: All rows from the left data frame and matching rows from the right data frame
merge_left = pd.merge(users, user4, on = "name", how = "left")
merge_left

Unnamed: 0,name,age,gender,job,height
0,alice,19,F,student,165.0
1,john,26,M,student,180.0
2,eric,22,M,student,175.0
3,paul,58,F,manager,
4,peter,33,M,engineer,
5,julie,44,F,scientist,171.0


In [69]:
# right outer join: Only matching rows from the left data frame and all rows from the right data frame
merge_right = pd.merge(users, user4, on = "name", how = "right")
print (merge_right.size)
merge_right

25


Unnamed: 0,name,age,gender,job,height
0,alice,19.0,F,student,165
1,john,26.0,M,student,180
2,eric,22.0,M,student,175
3,julie,44.0,F,scientist,171
4,anderson,,,,169


## Summarizing

In [40]:
print (type(users))
users

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
2,eric,22,M,student
3,paul,58,F,manager
4,peter,33,M,engineer
5,julie,44,F,scientist


In [41]:
users.head()

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
2,eric,22,M,student
3,paul,58,F,manager
4,peter,33,M,engineer


In [42]:
users.head(3)

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
2,eric,22,M,student


In [43]:
users.tail()

Unnamed: 0,name,age,gender,job
1,john,26,M,student
2,eric,22,M,student
3,paul,58,F,manager
4,peter,33,M,engineer
5,julie,44,F,scientist


In [44]:
users.tail(3)

Unnamed: 0,name,age,gender,job
3,paul,58,F,manager
4,peter,33,M,engineer
5,julie,44,F,scientist


In [55]:
users.sample()

Unnamed: 0,name,age,gender,job
3,paul,58,F,manager


In [58]:
users.sample(3)

Unnamed: 0,name,age,gender,job
2,eric,22,M,student
0,alice,19,F,student
1,john,26,M,student


In [61]:
print (type(users.any()))
users.any()

<class 'pandas.core.series.Series'>


name      True
age       True
gender    True
job       True
dtype: bool

In [62]:
print (users.index)
print (users.columns)

RangeIndex(start=0, stop=6, step=1)
Index(['name', 'age', 'gender', 'job'], dtype='object')


In [64]:
print (users.dtypes)
print (type(users.dtypes))

name      object
age        int64
gender    object
job       object
dtype: object
<class 'pandas.core.series.Series'>


In [66]:
users

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
2,eric,22,M,student
3,paul,58,F,manager
4,peter,33,M,engineer
5,julie,44,F,scientist


In [65]:
print (users.shape)
print (f"So the Row# = {users.shape[0]} and Col# = {users.shape[1]}")

(6, 4)
So the Row# = 6 and Col# = 4


In [67]:
print (users.values)
print (type(users.values), users.values.ndim, users.values.shape, users.values.size)

[['alice' 19 'F' 'student']
 ['john' 26 'M' 'student']
 ['eric' 22 'M' 'student']
 ['paul' 58 'F' 'manager']
 ['peter' 33 'M' 'engineer']
 ['julie' 44 'F' 'scientist']]
<class 'numpy.ndarray'> 2 (6, 4) 24


In [71]:
users

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
2,eric,22,M,student
3,paul,58,F,manager
4,peter,33,M,engineer
5,julie,44,F,scientist


In [73]:
print (users.job.value_counts())
print ()
print (users.gender.value_counts())

student      3
manager      1
engineer     1
scientist    1
Name: job, dtype: int64

F    3
M    3
Name: gender, dtype: int64


In [74]:
users.describe()

Unnamed: 0,age
count,6.0
mean,33.666667
std,14.895189
min,19.0
25%,23.0
50%,29.5
75%,41.25
max,58.0


In [75]:
users.describe(include=['object'])

Unnamed: 0,name,gender,job
count,6,6,6
unique,6,2,4
top,alice,F,student
freq,1,3,3


In [76]:
users.describe(include = 'all')

Unnamed: 0,name,age,gender,job
count,6,6.0,6,6
unique,6,,2,4
top,alice,,F,student
freq,1,,3,3
mean,,33.666667,,
std,,14.895189,,
min,,19.0,,
25%,,23.0,,
50%,,29.5,,
75%,,41.25,,


In [77]:
users.describe(include = 'all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
name,6.0,6.0,alice,1.0,,,,,,,
age,6.0,,,,33.666667,14.895189,19.0,23.0,29.5,41.25,58.0
gender,6.0,2.0,F,3.0,,,,,,,
job,6.0,4.0,student,3.0,,,,,,,


In [78]:
users.describe(include = 'all').transpose()

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
name,6.0,6.0,alice,1.0,,,,,,,
age,6.0,,,,33.666667,14.895189,19.0,23.0,29.5,41.25,58.0
gender,6.0,2.0,F,3.0,,,,,,,
job,6.0,4.0,student,3.0,,,,,,,


In [79]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    6 non-null      object
 1   age     6 non-null      int64 
 2   gender  6 non-null      object
 3   job     6 non-null      object
dtypes: int64(1), object(3)
memory usage: 320.0+ bytes


In [80]:
merge_outer.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7 entries, 0 to 6
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   name    7 non-null      object 
 1   age     6 non-null      float64
 2   gender  6 non-null      object 
 3   job     6 non-null      object 
 4   height  5 non-null      float64
dtypes: float64(2), object(3)
memory usage: 336.0+ bytes


## Column Selection

In [82]:
print (users.job)
print (type(users.job))

0      student
1      student
2      student
3      manager
4     engineer
5    scientist
Name: job, dtype: object
<class 'pandas.core.series.Series'>


In [84]:
print (users['job'])
print (type(users['job']))

0      student
1      student
2      student
3      manager
4     engineer
5    scientist
Name: job, dtype: object
<class 'pandas.core.series.Series'>


In [88]:
users[['job', 'age']]

Unnamed: 0,job,age
0,student,19
1,student,26
2,student,22
3,manager,58
4,engineer,33
5,scientist,44


In [89]:
my_cols = ['age', 'job']
users[my_cols]

Unnamed: 0,age,job
0,19,student
1,26,student
2,22,student
3,58,manager
4,33,engineer
5,44,scientist


## Row Selection

In [92]:
df = users.copy()
print (df)
print ()
print (df.iloc[3])
print (type(df.iloc[3]))

    name  age gender        job
0  alice   19      F    student
1   john   26      M    student
2   eric   22      M    student
3   paul   58      F    manager
4  peter   33      M   engineer
5  julie   44      F  scientist

name         paul
age            58
gender          F
job       manager
Name: 3, dtype: object
<class 'pandas.core.series.Series'>


In [93]:
df = users.copy()
print (df)
print ()
print (df.loc[3])
print (type(df.loc[3]))

    name  age gender        job
0  alice   19      F    student
1   john   26      M    student
2   eric   22      M    student
3   paul   58      F    manager
4  peter   33      M   engineer
5  julie   44      F  scientist

name         paul
age            58
gender          F
job       manager
Name: 3, dtype: object
<class 'pandas.core.series.Series'>


In [102]:
print (df)
print ()
print (df.iloc[4][3], df.iloc[4]['job'], df.loc[4][3], df.loc[4]['job'])    # engineer
print (df.iloc[4, 3], df.iloc[4]['job'], df.loc[4][3], df.loc[4, 'job'])    # engineer

    name  age gender        job
0  alice   19      F    student
1   john   26      M    student
2   eric   22      M    student
3   paul   58      F    manager
4  peter   33      M   engineer
5  julie   44      F  scientist

engineer engineer engineer engineer
engineer engineer engineer engineer


In [104]:
print (df.shape, len(df), df.ndim, df.size)
print (merge_outer.shape, len(merge_outer), merge_outer.ndim, merge_outer.size)

(6, 4) 6 2 24
(7, 5) 7 2 35


In [105]:
users

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
2,eric,22,M,student
3,paul,58,F,manager
4,peter,33,M,engineer
5,julie,44,F,scientist


In [111]:
# increasing all age values by 100
df = users.copy()
for i in range(df.shape[0]):
    current_row = df.iloc[i]
    current_row.age += 100
    df.iloc[i] = current_row
df

Unnamed: 0,name,age,gender,job
0,alice,119,F,student
1,john,126,M,student
2,eric,122,M,student
3,paul,158,F,manager
4,peter,133,M,engineer
5,julie,144,F,scientist


In [112]:
# import warnings
# warnings.filterwarnings('ignore')
for i in range(df.shape[0]):
    current_row = df.iloc[i]
    current_row.age += 100
    df.iloc[i] = current_row
    
df

Unnamed: 0,name,age,gender,job
0,alice,219,F,student
1,john,226,M,student
2,eric,222,M,student
3,paul,258,F,manager
4,peter,233,M,engineer
5,julie,244,F,scientist


## Row Selection and Filtering

In [113]:
print (users.age < 40)
users[users.age < 40]    # select * from users where age < 40;

0     True
1     True
2     True
3    False
4     True
5    False
Name: age, dtype: bool


Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
2,eric,22,M,student
4,peter,33,M,engineer


In [115]:
users[users.age < 40].job   # select job from users where age < 40

0     student
1     student
2     student
4    engineer
Name: job, dtype: object

In [116]:
users[users.age < 40]['job']

0     student
1     student
2     student
4    engineer
Name: job, dtype: object

In [117]:
users[users.age < 40][['job', 'name']]   # select job, name from users where age < 40

Unnamed: 0,job,name
0,student,alice
1,student,john
2,student,eric
4,engineer,peter


In [119]:
users

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
1,john,26,M,student
2,eric,22,M,student
3,paul,58,F,manager
4,peter,33,M,engineer
5,julie,44,F,scientist


In [121]:
print ((users.age > 30) & (users.gender == 'F'))
users[(users.age > 30) & (users.gender == 'F')]

0    False
1    False
2    False
3     True
4    False
5     True
dtype: bool


Unnamed: 0,name,age,gender,job
3,paul,58,F,manager
5,julie,44,F,scientist


In [123]:
users[(users.age > 30) & (users.gender == 'F')][['name', 'job', 'gender']]
# select name, job, gender from users where age > 30 and gender = 'F'; 

Unnamed: 0,name,job,gender
3,paul,manager,F
5,julie,scientist,F


In [124]:
users[(users.age > 30) | (users.gender == 'F')][['name', 'job', 'gender', 'age']]

Unnamed: 0,name,job,gender,age
0,alice,student,F,19
3,paul,manager,F,58
4,peter,engineer,M,33
5,julie,scientist,F,44


In [125]:
users[(users.job == "engineer") | (users.job == 'student')][['name', 'job', 'gender', 'age']]

Unnamed: 0,name,job,gender,age
0,alice,student,F,19
1,john,student,M,26
2,eric,student,M,22
4,peter,engineer,M,33


In [126]:
users[users.job.isin(["engineer", 'student'])][['name', 'job', 'gender', 'age']]

Unnamed: 0,name,job,gender,age
0,alice,student,F,19
1,john,student,M,26
2,eric,student,M,22
4,peter,engineer,M,33


## Sorting

In [129]:
df = users.copy()
print (df.age)
print (df.age.sort_values())
print (df.age.sort_values(ignore_index = True))

0    19
1    26
2    22
3    58
4    33
5    44
Name: age, dtype: int64
0    19
2    22
1    26
4    33
5    44
3    58
Name: age, dtype: int64
0    19
1    22
2    26
3    33
4    44
5    58
Name: age, dtype: int64


In [130]:
df.sort_values(by = 'age')

Unnamed: 0,name,age,gender,job
0,alice,19,F,student
2,eric,22,M,student
1,john,26,M,student
4,peter,33,M,engineer
5,julie,44,F,scientist
3,paul,58,F,manager


In [131]:
df = df.sort_values(by = 'age', ascending = False)
df

Unnamed: 0,name,age,gender,job
3,paul,58,F,manager
5,julie,44,F,scientist
4,peter,33,M,engineer
1,john,26,M,student
2,eric,22,M,student
0,alice,19,F,student


In [137]:
df = users.copy()
df.sort_values(by = 'age', ascending = False, inplace = True, ignore_index = True)
df

Unnamed: 0,name,age,gender,job
0,paul,58,F,manager
1,julie,44,F,scientist
2,peter,33,M,engineer
3,john,26,M,student
4,eric,22,M,student
5,alice,19,F,student


In [139]:
df = users.copy()
df.sort_values(by = ['job', 'age'], inplace = True, ignore_index = True)
df

Unnamed: 0,name,age,gender,job
0,peter,33,M,engineer
1,paul,58,F,manager
2,julie,44,F,scientist
3,alice,19,F,student
4,eric,22,M,student
5,john,26,M,student


In [141]:
df = users.copy()
df.sort_values(by = ['job', 'age'], inplace = True)
print (df.index)
df.index = range(0, df.shape[0])
print (df.index)
df

Int64Index([4, 3, 5, 0, 2, 1], dtype='int64')
RangeIndex(start=0, stop=6, step=1)


Unnamed: 0,name,age,gender,job
0,peter,33,M,engineer
1,paul,58,F,manager
2,julie,44,F,scientist
3,alice,19,F,student
4,eric,22,M,student
5,john,26,M,student
