# Exploratory Data Analysis with Pandas
#####################################################################################

This Jupyter Notebook is designed for the CI7340 Applied Data Programming
 Module for the MSc. Data Science degree programm at Kingston University.

Parts of this module is borrowed from the official [Pandas](https://pandas.pydata.org/) website.

Copyright@ *Nabajeet Barman*, Kingston University, London, UK

#####################################################################################


## Topics Covered:

> * Introduction to Pandas
> * Key Features of Pandas
> * Series and Dataframes
> * Viewing Data
> * Calculating Summary Statistics
> * Selection

A cheatsheet summarizing the important aspects can be found [here](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)

## Introduction to Pandas

The name Pandas is derived from the word Panel Data – an Econometrics from Multidimensional data.

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language [[1]](https://pandas.pydata.org/).

* Built on top of NumPy and is a part of the SciPy ecosystem (Scientific Computing Tools for Python)

* Used in StatsModel, sklearn-pandas, Plotly, IPython, Jupyter, Spyder




## Key Features of Pandas
> * Fast and efficient DataFrame object with default and customized indexing.
> * Tools for loading data into in-memory data objects from different file formats.
> * Data alignment and integrated handling of missing data.
> * Reshaping and pivoting of date sets.
> * Label-based slicing, indexing and subsetting of large data sets.
> * Columns from a data structure can be deleted or inserted.
> * Group by data for aggregation and transformations.
> * High performance merging and joining of data.
> * Time Series functionality.

### Import the required libraries

If not installed previously, use either of the following commands: 
> `pip install pandas` or alternatively, `conda install pandas`

In [1]:
import numpy as np
import pandas as pd
# if you want to check the version!!!
pd.__version__

'1.1.5'

# Core components of pandas: Series and DataFrames

A **DataFrame** is a 2D data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns. It is similar to a spreadsheet, a SQL table or the data.frame in R.

Each column in a DataFrame is a **Series** (a one-dimensional array of values with an index) as shown in below example.

![Series vs. Dataframe ](https://drive.google.com/uc?export=view&id=1Gf-cC8JxRAV8NpVbWioNDp7I4liw2YJ5)



# Creating DataFrame using python dictionary

In [2]:
data = {"Name":["Shashwat","Ajit","Ranjit"], "Age":[32,40,33]}
df=pd.DataFrame(data)
df

Unnamed: 0,Name,Age
0,Shashwat,32
1,Ajit,40
2,Ranjit,33


In [3]:
df.add({"Name":"NewName", "Age":22}) # Did not overwrite data

Unnamed: 0,Name,Age
0,ShashwatNewName,54
1,AjitNewName,62
2,RanjitNewName,55


In [4]:
df[df['Age']>35]

Unnamed: 0,Name,Age
1,Ajit,40


In [5]:
df = pd.DataFrame(df,index=[44,55,66])
df

Unnamed: 0,Name,Age
44,,
55,,
66,,


In [6]:
data = {
'employee_name' : ['Sam', 'Max', 'Tony', 'Sarah', 'Tania'],
'employee_dept' : ['Research', 'HR', 'Marketing', 'Sales', 'Finance']
}

In [7]:
employee_records = pd.DataFrame(data)
employee_records


Unnamed: 0,employee_name,employee_dept
0,Sam,Research
1,Max,HR
2,Tony,Marketing
3,Sarah,Sales
4,Tania,Finance


In [8]:
#Assigning Index in the dataframe
employee_records = pd.DataFrame(data, index = [1110,1111,1112,1113,1114])
employee_records


Unnamed: 0,employee_name,employee_dept
1110,Sam,Research
1111,Max,HR
1112,Tony,Marketing
1113,Sarah,Sales
1114,Tania,Finance


In [9]:
employee_records.index.name='employee_id'
employee_records

Unnamed: 0_level_0,employee_name,employee_dept
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1110,Sam,Research
1111,Max,HR
1112,Tony,Marketing
1113,Sarah,Sales
1114,Tania,Finance


In [10]:
# Accesing records by Row Index
employee_records.loc[1110]

employee_name         Sam
employee_dept    Research
Name: 1110, dtype: object

In [11]:
series = pd.Series([4234,3243,34,34,4563,4489], name='employee_id', index=[1,2,3,4,5,6])
series

1    4234
2    3243
3      34
4      34
5    4563
6    4489
Name: employee_id, dtype: int64

In [12]:
# Accessing records by Column name
employee_records.iloc[0:4:2,0:2]

Unnamed: 0_level_0,employee_name,employee_dept
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1110,Sam,Research
1112,Tony,Marketing


# Creating a DF by passing a NumPy array

In [13]:
# define three column names 
columns = ['col_1','col_2','col_3']
# Define index
indexs = ['a','b','c']

#define a NumPy array of size 3*3
data = np.array([[1,2,3],[3,4,5],[6,7,8]])
# creating pandas data frame
sample_df = pd.DataFrame(data, index=indexs, columns=columns)
sample_df

Unnamed: 0,col_1,col_2,col_3
a,1,2,3
b,3,4,5
c,6,7,8


# Create a series from scratch

In [14]:
emp_id = pd.Series([4434,3345,4432,1123,5678,7766], name='employee_id')
emp_id

0    4434
1    3345
2    4432
3    1123
4    5678
5    7766
Name: employee_id, dtype: int64

# Creating a df using multple Series with same index

In [15]:
s1=pd.Series([10,20,30,40,50],index=[1,2,3,4,5])
s2=pd.Series([21,34,43,55,73],index=[1,2,3,4,5])
sample_df = pd.DataFrame({'col_1':s1,'col_2':s2})
sample_df

Unnamed: 0,col_1,col_2
1,10,21
2,20,34
3,30,43
4,40,55
5,50,73


In [16]:
# Index order changed and few index missing, this caused Nan inserts but column ordering was resolved correctly
s1=pd.Series([10,20,30,40,50],index=[1,2,3,4,5])
s2=pd.Series([21,34,43,55,73],index=[2,1,6,4,3])
sample_df = pd.DataFrame({'col_1':s1,'col_2':s2})
sample_df

Unnamed: 0,col_1,col_2
1,10.0,34.0
2,20.0,21.0
3,30.0,73.0
4,40.0,55.0
5,50.0,
6,,43.0


In [17]:
sample_df.dtypes

col_1    float64
col_2    float64
dtype: object

In [18]:
df1 = pd.DataFrame({
    "A": np.random.rand(3),
    "B": 1,
    "C":"foo",
    "D": pd.Timestamp("20010102"),
    "E": pd.Series([1.0] * 3).astype("float32"),
    "F": False,
    "G": pd.Series([1] * 3, dtype="int8"),
})
df1.dtypes

A           float64
B             int64
C            object
D    datetime64[ns]
E           float32
F              bool
G              int8
dtype: object

In [19]:
# Try to change 4 in above code for value of G or D, expectation was that it will create NaN value but it did not happen.
df1 = pd.DataFrame({
    "A": np.random.rand(4), #this fixes the length of columns
    "B": 1,
    "C":"foo",
    "D": pd.Timestamp("20010102"),
    "E": pd.Series([1.0] * 3).astype("float32"),
    "F": False,
    "G": pd.Series([1] * 4, dtype="int8"),
})
df1

Unnamed: 0,A,B,C,D,E,F,G
0,0.720653,1,foo,2001-01-02,1.0,False,1
1,0.839359,1,foo,2001-01-02,1.0,False,1
2,0.270158,1,foo,2001-01-02,1.0,False,1
3,0.267513,1,foo,2001-01-02,,False,1


In [20]:
# Why C column is showing as object and not String
df1.dtypes

A           float64
B             int64
C            object
D    datetime64[ns]
E           float32
F              bool
G              int8
dtype: object

# Creating the Employee DataFrame

In [21]:
employee_records = pd.DataFrame({
        'employee_name': ['Sam', 'Max', 'Tony', 'Sarah', 'Tania', 'David', 
                         'Mark','Alice', 'Charles', 'Bob', 'Anna'],
        'employee_dept': ['Research','HR','Marketing','Sales', 'Finance', 'IT', 'HR', 'Marketing', 'IT', 'Finance','Sales'],
        'employee_id' : [10001, 10002, 10003, 10004, 10005, 10006, 10007, 10008, 10009, 10010, 10011],
        'salary'     : [45034.88, 65343.45, 53423.27, 76422.34, 58753.00, 34323.44, 66544.60, 34354.66, 55234.96, 39078.60, 44567.88]
    })
employee_records



Unnamed: 0,employee_name,employee_dept,employee_id,salary
0,Sam,Research,10001,45034.88
1,Max,HR,10002,65343.45
2,Tony,Marketing,10003,53423.27
3,Sarah,Sales,10004,76422.34
4,Tania,Finance,10005,58753.0
5,David,IT,10006,34323.44
6,Mark,HR,10007,66544.6
7,Alice,Marketing,10008,34354.66
8,Charles,IT,10009,55234.96
9,Bob,Finance,10010,39078.6


In [22]:
# print top 5 rows
employee_records.head()

Unnamed: 0,employee_name,employee_dept,employee_id,salary
0,Sam,Research,10001,45034.88
1,Max,HR,10002,65343.45
2,Tony,Marketing,10003,53423.27
3,Sarah,Sales,10004,76422.34
4,Tania,Finance,10005,58753.0


In [23]:
# print bottom 6 rows
employee_records.tail(6)

Unnamed: 0,employee_name,employee_dept,employee_id,salary
5,David,IT,10006,34323.44
6,Mark,HR,10007,66544.6
7,Alice,Marketing,10008,34354.66
8,Charles,IT,10009,55234.96
9,Bob,Finance,10010,39078.6
10,Anna,Sales,10011,44567.88


In [24]:
# print bottom 6 rows
employee_records.tail(10000000)

Unnamed: 0,employee_name,employee_dept,employee_id,salary
0,Sam,Research,10001,45034.88
1,Max,HR,10002,65343.45
2,Tony,Marketing,10003,53423.27
3,Sarah,Sales,10004,76422.34
4,Tania,Finance,10005,58753.0
5,David,IT,10006,34323.44
6,Mark,HR,10007,66544.6
7,Alice,Marketing,10008,34354.66
8,Charles,IT,10009,55234.96
9,Bob,Finance,10010,39078.6


In [25]:
#To do how pandas dataframe stores index?
employee_records.index # It is not showing the correct index but the range. Pandas is ssaving our index as a range
employee_records.index.values # It will show the Index values.

#Displaying Column Names
employee_records.keys()
employee_records.columns

Index(['employee_name', 'employee_dept', 'employee_id', 'salary'], dtype='object')

In [26]:
#Convert Dataframe to NumPy, What would be use case for these?
employee_records.to_numpy()

array([['Sam', 'Research', 10001, 45034.88],
       ['Max', 'HR', 10002, 65343.45],
       ['Tony', 'Marketing', 10003, 53423.27],
       ['Sarah', 'Sales', 10004, 76422.34],
       ['Tania', 'Finance', 10005, 58753.0],
       ['David', 'IT', 10006, 34323.44],
       ['Mark', 'HR', 10007, 66544.6],
       ['Alice', 'Marketing', 10008, 34354.66],
       ['Charles', 'IT', 10009, 55234.96],
       ['Bob', 'Finance', 10010, 39078.6],
       ['Anna', 'Sales', 10011, 44567.88]], dtype=object)

In [27]:
employee_records

Unnamed: 0,employee_name,employee_dept,employee_id,salary
0,Sam,Research,10001,45034.88
1,Max,HR,10002,65343.45
2,Tony,Marketing,10003,53423.27
3,Sarah,Sales,10004,76422.34
4,Tania,Finance,10005,58753.0
5,David,IT,10006,34323.44
6,Mark,HR,10007,66544.6
7,Alice,Marketing,10008,34354.66
8,Charles,IT,10009,55234.96
9,Bob,Finance,10010,39078.6


In [28]:
df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
                   'B': {0: 1, 1: 3, 2: 5},
                   'C': {0: 2, 1: 4, 2: 6}})
df

Unnamed: 0,A,B,C
0,a,1,2
1,b,3,4
2,c,5,6


In [29]:
df.melt(id_vars=['A'], value_vars=['B'])

Unnamed: 0,A,variable,value
0,a,B,1
1,b,B,3
2,c,B,5


In [30]:
df.melt(id_vars=['A'], value_vars=['B', 'C'])


Unnamed: 0,A,variable,value
0,a,B,1
1,b,B,3
2,c,B,5
3,a,C,2
4,b,C,4
5,c,C,6


In [31]:
df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=["A", "B", "C"])
s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])
index = pd.date_range("1/1/2000", periods=8)

NameError: ignored

In [32]:
index1 = pd.date_range('1/1/2021', periods=8, freq='D')
index2 = pd.date_range('1/1/2021', periods=8, freq='M')
index3 = pd.date_range('1/1/2021', periods=8, freq='Y')

index3

DatetimeIndex(['2021-12-31', '2022-12-31', '2023-12-31', '2024-12-31',
               '2025-12-31', '2026-12-31', '2027-12-31', '2028-12-31'],
              dtype='datetime64[ns]', freq='A-DEC')

In [33]:
long_series = pd.Series(np.random.randn(1000))
long_series

0      1.694720
1      0.517790
2      0.223495
3      0.329717
4     -0.049942
         ...   
995   -0.200354
996   -2.391184
997    0.254286
998    0.870401
999    0.090848
Length: 1000, dtype: float64

In [34]:
type(index3)

pandas.core.indexes.datetimes.DatetimeIndex

In [35]:
df

Unnamed: 0,A,B,C
0,a,1,2
1,b,3,4
2,c,5,6


# Accessing Items of DF

shape: gives the axis dimensions of the object, consistent with ndarray

Axis labels
Series: index (only axis)

DataFrame: index (rows) and columns

In [50]:
# Accessing dataframe [row][column]
print(df[:2][:2])


   a  b  c
0  a  1  2
1  b  3  4


In [52]:

#Accessing column
print(df['a'])


0    a
1    b
2    c
Name: a, dtype: object


In [53]:
#Accessing particular value
print(df['a'][1])
print(df[:2][:2])

b
   a  b  c
0  a  1  2
1  b  3  4


In [45]:
df.columns = [x.lower() for x in df.columns]
df

Unnamed: 0,a,b,c
0,a,1,2
1,b,3,4
2,c,5,6


In [54]:
s.array

NameError: ignored

In [55]:
s.index.array

NameError: ignored

In [56]:
s.to_numpy()

NameError: ignored

In [57]:
employee_records

Unnamed: 0,employee_name,employee_dept,employee_id,salary
0,Sam,Research,10001,45034.88
1,Max,HR,10002,65343.45
2,Tony,Marketing,10003,53423.27
3,Sarah,Sales,10004,76422.34
4,Tania,Finance,10005,58753.0
5,David,IT,10006,34323.44
6,Mark,HR,10007,66544.6
7,Alice,Marketing,10008,34354.66
8,Charles,IT,10009,55234.96
9,Bob,Finance,10010,39078.6


In [40]:
employee_records.describe()

Unnamed: 0,employee_id,salary
count,11.0,11.0
mean,10006.0,52098.28
std,3.316625,13923.224633
min,10001.0,34323.44
25%,10003.5,41823.24
50%,10006.0,53423.27
75%,10008.5,62048.225
max,10011.0,76422.34


In [39]:
employee_records.describe(include='all')

Unnamed: 0,employee_name,employee_dept,employee_id,salary
count,11,11,11.0,11.0
unique,11,6,,
top,Charles,Marketing,,
freq,1,2,,
mean,,,10006.0,52098.28
std,,,3.316625,13923.224633
min,,,10001.0,34323.44
25%,,,10003.5,41823.24
50%,,,10006.0,53423.27
75%,,,10008.5,62048.225


In [38]:
# Transpose
employee_records.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
employee_name,Sam,Max,Tony,Sarah,Tania,David,Mark,Alice,Charles,Bob,Anna
employee_dept,Research,HR,Marketing,Sales,Finance,IT,HR,Marketing,IT,Finance,Sales
employee_id,10001,10002,10003,10004,10005,10006,10007,10008,10009,10010,10011
salary,45034.9,65343.4,53423.3,76422.3,58753,34323.4,66544.6,34354.7,55235,39078.6,44567.9


#Accesing DataFrame

In [37]:
#By Coluimn Name
employee_records['salary']

0     45034.88
1     65343.45
2     53423.27
3     76422.34
4     58753.00
5     34323.44
6     66544.60
7     34354.66
8     55234.96
9     39078.60
10    44567.88
Name: salary, dtype: float64

In [58]:
# By Index
employee_records[0:7]

Unnamed: 0,employee_name,employee_dept,employee_id,salary
0,Sam,Research,10001,45034.88
1,Max,HR,10002,65343.45
2,Tony,Marketing,10003,53423.27
3,Sarah,Sales,10004,76422.34
4,Tania,Finance,10005,58753.0
5,David,IT,10006,34323.44
6,Mark,HR,10007,66544.6


In [59]:
# Getting a particular element
employee_records['employee_name'][0]

'Sam'

In [60]:
isinstance(employee_records,pd.DataFrame)

True

In [61]:
# Check if element is of a particular type
print(isinstance(employee_records['employee_name'][1], str))
print(type(employee_records['employee_name'][1]))

True
<class 'str'>


# Slicing in pandas


In [62]:
# Slicing rows, column based slicing not allowed
employee_records[0:5]

Unnamed: 0,employee_name,employee_dept,employee_id,salary
0,Sam,Research,10001,45034.88
1,Max,HR,10002,65343.45
2,Tony,Marketing,10003,53423.27
3,Sarah,Sales,10004,76422.34
4,Tania,Finance,10005,58753.0


In [63]:
employee_records[0::2] # Adding skip

Unnamed: 0,employee_name,employee_dept,employee_id,salary
0,Sam,Research,10001,45034.88
2,Tony,Marketing,10003,53423.27
4,Tania,Finance,10005,58753.0
6,Mark,HR,10007,66544.6
8,Charles,IT,10009,55234.96
10,Anna,Sales,10011,44567.88


#Accessing using

1) iloc -->  Uses default Indexing, assigned by python

2) loc --> Uses Index Labels

In [64]:
employee_records.loc[2]

employee_name         Tony
employee_dept    Marketing
employee_id          10003
salary             53423.3
Name: 2, dtype: object

In [65]:
employee_records.loc[3,'salary']

76422.34

In [66]:
employee_records.iloc[2]

employee_name         Tony
employee_dept    Marketing
employee_id          10003
salary             53423.3
Name: 2, dtype: object

In [67]:
# Row based access, non-inclusive end index 
employee_records[3:6]

Unnamed: 0,employee_name,employee_dept,employee_id,salary
3,Sarah,Sales,10004,76422.34
4,Tania,Finance,10005,58753.0
5,David,IT,10006,34323.44


In [68]:
# Row based access, inclusive end index, more efficient compare to employee_records[3:6]
employee_records.loc[3:6]

Unnamed: 0,employee_name,employee_dept,employee_id,salary
3,Sarah,Sales,10004,76422.34
4,Tania,Finance,10005,58753.0
5,David,IT,10006,34323.44
6,Mark,HR,10007,66544.6


In [69]:
# Row based access, non-inclusive end index, more efficient compare to employee_records[3:6]
employee_records.iloc[3:6]

Unnamed: 0,employee_name,employee_dept,employee_id,salary
3,Sarah,Sales,10004,76422.34
4,Tania,Finance,10005,58753.0
5,David,IT,10006,34323.44


In [70]:
#Slicing columns names
employee_records.iloc[:,1:3]

Unnamed: 0,employee_dept,employee_id
0,Research,10001
1,HR,10002
2,Marketing,10003
3,Sales,10004
4,Finance,10005
5,IT,10006
6,HR,10007
7,Marketing,10008
8,IT,10009
9,Finance,10010


In [71]:
# Accessing particular index list of ID's (not a range)
employee_records.loc[[3,7]] # Returns row 3 and 7, returns a data frame

Unnamed: 0,employee_name,employee_dept,employee_id,salary
3,Sarah,Sales,10004,76422.34
7,Alice,Marketing,10008,34354.66


In [72]:
# Accessing particular index list of ID's (not a range)
employee_records.iloc[[3,7]] # Returns row 3 and 7, returns a data frame

Unnamed: 0,employee_name,employee_dept,employee_id,salary
3,Sarah,Sales,10004,76422.34
7,Alice,Marketing,10008,34354.66


In [73]:
# Returns rows which have value as true
employee_records.loc[[False,False,False,True,False,True,False,False,True,False,False]]

Unnamed: 0,employee_name,employee_dept,employee_id,salary
3,Sarah,Sales,10004,76422.34
5,David,IT,10006,34323.44
8,Charles,IT,10009,55234.96


In [74]:
# Returns rows which have value as true
employee_records.iloc[[False,False,False,True,False,True,False,False,True,False,False]]

Unnamed: 0,employee_name,employee_dept,employee_id,salary
3,Sarah,Sales,10004,76422.34
5,David,IT,10006,34323.44
8,Charles,IT,10009,55234.96


# Conditional Selection

In [75]:
employee_records.loc[employee_records['salary']>40000]

Unnamed: 0,employee_name,employee_dept,employee_id,salary
0,Sam,Research,10001,45034.88
1,Max,HR,10002,65343.45
2,Tony,Marketing,10003,53423.27
3,Sarah,Sales,10004,76422.34
4,Tania,Finance,10005,58753.0
6,Mark,HR,10007,66544.6
8,Charles,IT,10009,55234.96
10,Anna,Sales,10011,44567.88


In [76]:
# Check column info, can be used to analyse data initially
employee_records.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   employee_name  11 non-null     object 
 1   employee_dept  11 non-null     object 
 2   employee_id    11 non-null     int64  
 3   salary         11 non-null     float64
dtypes: float64(1), int64(1), object(2)
memory usage: 480.0+ bytes


In [77]:
#Sorting by a particular column
#Default is ascending=True
employee_records1 = employee_records.sort_values(by='salary', ascending=False)

In [78]:
employee_records1.iloc[[3,7]]

Unnamed: 0,employee_name,employee_dept,employee_id,salary
4,Tania,Finance,10005,58753.0
10,Anna,Sales,10011,44567.88


In [79]:
employee_records1.loc[[3,7]]

Unnamed: 0,employee_name,employee_dept,employee_id,salary
3,Sarah,Sales,10004,76422.34
7,Alice,Marketing,10008,34354.66


In [80]:
employee_records['salary'] > 60000

0     False
1      True
2     False
3      True
4     False
5     False
6      True
7     False
8     False
9     False
10    False
Name: salary, dtype: bool

In [81]:
# Fetch particular columns from rows satisfying a particular condition
employee_records.loc[employee_records['salary'] > 60000, ['employee_id']]

Unnamed: 0,employee_id
1,10002
3,10004
6,10007


In [82]:
# Fetch particular columns from rows satisfying a particular condition
employee_records.loc[employee_records['salary'] > 60000, ['employee_id','employee_dept']]

Unnamed: 0,employee_id,employee_dept
1,10002,HR
3,10004,Sales
6,10007,HR


In [83]:
#Access using cloumn name directly (as a attribute) --> similar to employee_records.columns
employee_records.employee_dept


0      Research
1            HR
2     Marketing
3         Sales
4       Finance
5            IT
6            HR
7     Marketing
8            IT
9       Finance
10        Sales
Name: employee_dept, dtype: object

In [84]:
employee_records.loc[employee_records['employee_dept']=='HR']

Unnamed: 0,employee_name,employee_dept,employee_id,salary
1,Max,HR,10002,65343.45
6,Mark,HR,10007,66544.6


# Selection with multiple conditions

In [85]:
employee_records.loc[(employee_records.salary > 40000) & (employee_records.salary < 60000)]

Unnamed: 0,employee_name,employee_dept,employee_id,salary
0,Sam,Research,10001,45034.88
2,Tony,Marketing,10003,53423.27
4,Tania,Finance,10005,58753.0
8,Charles,IT,10009,55234.96
10,Anna,Sales,10011,44567.88


In [86]:
employee_records.loc[(employee_records.employee_dept == 'IT') & (employee_records.salary < 60000)]

Unnamed: 0,employee_name,employee_dept,employee_id,salary
5,David,IT,10006,34323.44
8,Charles,IT,10009,55234.96


# Copying a DataFrame

In [87]:
# Not pass by reference copy, change in temp_df wont impact employee_records
temp_df = employee_records.copy()
#Set a value
temp_df.loc[2] = 5000 # set value for each column in that row

In [88]:
temp_df

Unnamed: 0,employee_name,employee_dept,employee_id,salary
0,Sam,Research,10001,45034.88
1,Max,HR,10002,65343.45
2,5000,5000,5000,5000.0
3,Sarah,Sales,10004,76422.34
4,Tania,Finance,10005,58753.0
5,David,IT,10006,34323.44
6,Mark,HR,10007,66544.6
7,Alice,Marketing,10008,34354.66
8,Charles,IT,10009,55234.96
9,Bob,Finance,10010,39078.6


In [90]:
employee_records

Unnamed: 0,employee_name,employee_dept,employee_id,salary
0,Sam,Research,10001,45034.88
1,Max,HR,10002,65343.45
2,Tony,Marketing,10003,53423.27
3,Sarah,Sales,10004,76422.34
4,Tania,Finance,10005,58753.0
5,David,IT,10006,34323.44
6,Mark,HR,10007,66544.6
7,Alice,Marketing,10008,34354.66
8,Charles,IT,10009,55234.96
9,Bob,Finance,10010,39078.6


In [89]:
temp_df2 = employee_records.copy()
temp_df2.loc[13] = ['Shashwat', 'IT', 10012, 50000]
temp_df2.loc[12] = 50000
# temp_df2.loc[14:15] = [['Padmesh', 'IT', 10013, 500000],['Padmesh1', 'IT', 10014, 500000]]
# TODO: Find how to add multiple entries to DF using loc or iloc.
temp_df2

Unnamed: 0,employee_name,employee_dept,employee_id,salary
0,Sam,Research,10001,45034.88
1,Max,HR,10002,65343.45
2,Tony,Marketing,10003,53423.27
3,Sarah,Sales,10004,76422.34
4,Tania,Finance,10005,58753.0
5,David,IT,10006,34323.44
6,Mark,HR,10007,66544.6
7,Alice,Marketing,10008,34354.66
8,Charles,IT,10009,55234.96
9,Bob,Finance,10010,39078.6


In [91]:
# TODO: convert all values for all rows to one value
temp_df.loc[:,'employee_id'] == 1000

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
Name: employee_id, dtype: bool

In [92]:
# Selecting rannge of rows and columns using iloc
employee_records.iloc[[4,2],[1,3]]

Unnamed: 0,employee_dept,salary
4,Finance,58753.0
2,Marketing,53423.27


In [93]:
# TODO: Selecting rannge of rows and columns using loc
#employee_records.loc[[4,2],[1,3]]

In [94]:
employee_records.iloc[0:4,:]

Unnamed: 0,employee_name,employee_dept,employee_id,salary
0,Sam,Research,10001,45034.88
1,Max,HR,10002,65343.45
2,Tony,Marketing,10003,53423.27
3,Sarah,Sales,10004,76422.34


# Iterating though a DataFrame

In [95]:
for indexx,row in employee_records.iterrows():
  print(indexx, row,'\n')

0 employee_name         Sam
employee_dept    Research
employee_id         10001
salary            45034.9
Name: 0, dtype: object 

1 employee_name        Max
employee_dept         HR
employee_id        10002
salary           65343.4
Name: 1, dtype: object 

2 employee_name         Tony
employee_dept    Marketing
employee_id          10003
salary             53423.3
Name: 2, dtype: object 

3 employee_name      Sarah
employee_dept      Sales
employee_id        10004
salary           76422.3
Name: 3, dtype: object 

4 employee_name      Tania
employee_dept    Finance
employee_id        10005
salary             58753
Name: 4, dtype: object 

5 employee_name      David
employee_dept         IT
employee_id        10006
salary           34323.4
Name: 5, dtype: object 

6 employee_name       Mark
employee_dept         HR
employee_id        10007
salary           66544.6
Name: 6, dtype: object 

7 employee_name        Alice
employee_dept    Marketing
employee_id          10008
salary          

# Selecting data containing specific string

In [96]:
#Select employees whose name contains 'ar' --> Case sensitive
employee_records.loc[employee_records.employee_name.str.contains('ar')]
# employee_records.loc[employee_records['employee_name'].str.contains('ar')]

Unnamed: 0,employee_name,employee_dept,employee_id,salary
3,Sarah,Sales,10004,76422.34
6,Mark,HR,10007,66544.6
8,Charles,IT,10009,55234.96


In [97]:
#Select employees whose name contains 'ar' --> Case sensitive
employee_records.loc[employee_records.employee_name.str.contains('Ar')]
# employee_records.loc[employee_records['employee_name'].str.contains('ar')]

Unnamed: 0,employee_name,employee_dept,employee_id,salary


# Regular Expression

In [98]:
employee_records.loc[employee_records.employee_dept.str.contains('ar|i')]

Unnamed: 0,employee_name,employee_dept,employee_id,salary
0,Sam,Research,10001,45034.88
2,Tony,Marketing,10003,53423.27
4,Tania,Finance,10005,58753.0
7,Alice,Marketing,10008,34354.66
9,Bob,Finance,10010,39078.6
