### Jupyter notebook commands


    Esc will take you into command mode where you can navigate around your notebook with arrow keys.
    While in command mode:
        A to insert a new cell above the current cell, 
        B to insert a new cell below.
        M to change the current cell to Markdown, 
        Y to change it back to code
        D + D (press the key twice) to delete the current cell
    Enter will take you from command mode back into edit mode for the given cell.
    Shift + Tab will show you the Docstring (documentation) for the the object you have just typed in a code cell – you can keep pressing this short cut to cycle through a few modes of documentation.
    Ctrl + Shift + - will split the current cell into two from where your cursor is.
    Esc + F Find and replace on your code but not the outputs.
    Esc + O Toggle cell output.
    Select Multiple Cells:
        Shift + J or Shift + Down selects the next sell in a downwards direction. You can also select sells in an upwards direction by using Shift + K or Shift + Up.
        Once cells are selected, you can then delete / copy / cut / paste / run them as a batch. This is helpful when you need to move parts of a notebook.
        You can also use Shift + M to merge multiple cells.


![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

### Pandas can work with several file formats:
Example:
    
    Comma-separated values (CSV)
    XLSX
    ZIP
    Plain Text (txt)
    JSON
    XML
    HTML
    Images
    Hierarchical Data Format
    PDF
    DOCX
    MP3
    MP4
    SQL
    
https://www.cbtnuggets.com/blog/technology/programming/14-file-types-you-can-import-into-pandas
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

# Pandas supports two Datatypes:
        Series: 1-D column of data.
        DataFrame: 2-D table with rows and column.

![image.png](attachment:image.png)

## Importing Pandas

#### In Jupyter Notebook, use Shift+Tab to show a tooltip with arguments of function/class and docstring

In [46]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [2]:
pd.__version__ ## To chck the version

'1.3.4'

## Creating pandas Series...

### Keyword for creating series is pd.Series where pd is the alias for pandas

In [3]:
'''Creating Series'''
L1=['eagle','flamingo','crow','dove']
birds = pd.Series(L1)

In [4]:
print(birds)

0       eagle
1    flamingo
2        crow
3        dove
dtype: object


In [5]:
'''Creating another Series'''
animals = pd.Series(['cat','dog','horse','elephant'])

In [6]:
animals

0         cat
1         dog
2       horse
3    elephant
dtype: object

### To check the datatype

In [7]:
'''Check the datatype'''
print(type(birds))
print(type(animals))

<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>


# DataFrame

![image.png](attachment:image.png)

## Creating a pandas DataFrame:
1. Prepare the data
2. Label the rows
3. Label the columns
**Default labeling will be done for rows and columns if not defined exclusively**

### 1. Using numpy array for data

In [8]:
Array = np.array([[1,1,1],[2,4,8],[3,9,27],
                 [4,16,64],[5,25,125],[6,36,216],
                 [7,49,343]])
print(Array)

[[  1   1   1]
 [  2   4   8]
 [  3   9  27]
 [  4  16  64]
 [  5  25 125]
 [  6  36 216]
 [  7  49 343]]


### Exclusively defining the rows and the columns:

In [10]:
### defining the rows exclusively ###
number = ['first','second', 'third', 'fourth','fifth','sixth', 'seventh']
### defining the columns excusively
column_names = ['1', 'squares', 'cubes']

In [11]:
df_Array=pd.DataFrame(data = Array, index = number, columns=column_names)
df_Array

Unnamed: 0,1,squares,cubes
first,1,1,1
second,2,4,8
third,3,9,27
fourth,4,16,64
fifth,5,25,125
sixth,6,36,216
seventh,7,49,343


### Default labeling for rows and columns

In [13]:
df_Array1 = pd.DataFrame(Array)

In [14]:
df_Array1

Unnamed: 0,0,1,2
0,1,1,1
1,2,4,8
2,3,9,27
3,4,16,64
4,5,25,125
5,6,36,216
6,7,49,343


### By combining the existing Series in the form of a dictionary

In [15]:
'''Creating DataFrame using already created Series'''
df = pd.DataFrame({"birds_name":birds,
                  "animal_name":animals})  
### birds_name is the key
### birds is the value

In [16]:
df    # keys of dictionary taken as column names

Unnamed: 0,birds_name,animal_name
0,eagle,cat
1,flamingo,dog
2,crow,horse
3,dove,elephant


In [17]:
'''Check the datatype'''
print(type(df))

<class 'pandas.core.frame.DataFrame'>


### Saving Dataframe into .csv file...

In [18]:
'''Save the file as a_b.csv'''
df.to_csv('a_b.csv',':')

#### Creating another DataFrame...

In [19]:
'''Let's create another DataFrame'''

items = pd.Series(['Milk','Bread','Butter','Sugar'])
price = pd.Series([50,45,30,60])
df_1 = pd.DataFrame({"Items": items,
                    "Price": price})

In [20]:
df_1

Unnamed: 0,Items,Price
0,Milk,50
1,Bread,45
2,Butter,30
3,Sugar,60


#### Creating another DataFrame using python Dictionary 

In [21]:
df_2 = pd.DataFrame({"Name":['A','B','C'],
                     "Age": [20,30,40]})

In [22]:
df_2

Unnamed: 0,Name,Age
0,A,20
1,B,30
2,C,40


## Importing Data File...
#### NOTE: Mention correctly the path where you have stored your datafile

In [25]:
pip install openpyxl     # required to install inorder to use resd_excel function
# pip install xlrd

Collecting openpyxl
  Downloading openpyxl-3.0.9-py2.py3-none-any.whl (242 kB)
[K     |████████████████████████████████| 242 kB 1.7 MB/s eta 0:00:01
[?25hCollecting et-xmlfile
  Downloading et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.1.0 openpyxl-3.0.9
Note: you may need to restart the kernel to use updated packages.


In [59]:
sample = pd.read_excel("sample.xlsx") 
print(type(sample))

<class 'pandas.core.frame.DataFrame'>


In [27]:
df = pd.read_excel('sample.xlsx',index_col=0) # index_col=0 means u want to make Name as index column

In [28]:
df

Unnamed: 0_level_0,Age,Gender,Salary
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Harry,25,M,10000
Potter,26,M,20000
Joey,25,M,30000
Ronny,29,M,50000
John,28,M,45000
Monica,21,F,35000
Nick,30,M,28000
Karen,35,F,31000
Susan,30,F,32000
Emma,36,F,52000


In [57]:
df = pd.read_excel('sample.xlsx')   # if no index_col specified
df

Unnamed: 0,Name,Age,Gender,Salary
0,Harry,25,M,10000
1,Potter,26,M,20000
2,Joey,25,M,30000
3,Ronny,29,M,50000
4,John,28,M,45000
5,Monica,21,F,35000
6,Nick,30,M,28000
7,Karen,35,F,31000
8,Susan,30,F,32000
9,Emma,36,F,52000


In [31]:
sample        

Unnamed: 0,Name,Age,Gender,Salary
0,Harry,25,M,10000
1,Potter,26,M,20000
2,Joey,25,M,30000
3,Ronny,29,M,50000
4,John,28,M,45000
5,Monica,21,F,35000
6,Nick,30,M,28000
7,Karen,35,F,31000
8,Susan,30,F,32000
9,Emma,36,F,52000


In [32]:
print(type(sample))

<class 'pandas.core.frame.DataFrame'>


In [33]:
print(type(sample["Name"]))

<class 'pandas.core.series.Series'>


In [34]:
print(sample['Age'])

0     25
1     26
2     25
3     29
4     28
5     21
6     30
7     35
8     30
9     36
10    29
11    23
12    21
13    33
14    34
Name: Age, dtype: int64


## Useful functions of Pandas

![image.png](attachment:image.png)

In [36]:
sample.head()

Unnamed: 0,Name,Age,Gender,Salary
0,Harry,25,M,10000
1,Potter,26,M,20000
2,Joey,25,M,30000
3,Ronny,29,M,50000
4,John,28,M,45000


In [37]:
sample.head(12)

Unnamed: 0,Name,Age,Gender,Salary
0,Harry,25,M,10000
1,Potter,26,M,20000
2,Joey,25,M,30000
3,Ronny,29,M,50000
4,John,28,M,45000
5,Monica,21,F,35000
6,Nick,30,M,28000
7,Karen,35,F,31000
8,Susan,30,F,32000
9,Emma,36,F,52000


![image.png](attachment:image.png)

In [38]:
sample.tail()

Unnamed: 0,Name,Age,Gender,Salary
10,Linda,29,F,55000
11,Jack,23,M,25000
12,Emily,21,F,41000
13,Noah,33,M,42000
14,Jacob,34,M,55000


In [39]:
sample.tail(2)

Unnamed: 0,Name,Age,Gender,Salary
13,Noah,33,M,42000
14,Jacob,34,M,55000


### for extracting the rows at random from the dataset, use the command .sample()

In [40]:
sample.sample(2)

Unnamed: 0,Name,Age,Gender,Salary
10,Linda,29,F,55000
13,Noah,33,M,42000


In [41]:
sample.sample(5)

Unnamed: 0,Name,Age,Gender,Salary
7,Karen,35,F,31000
10,Linda,29,F,55000
1,Potter,26,M,20000
3,Ronny,29,M,50000
13,Noah,33,M,42000


## Useful fuctions to get basic information about the data
+ .describe()- will give basic statistics of the data in the comprehensive format
+ .info()- will give information about the type of data and missing values
+ .mean()- mean of each column
+ .sum() - sum of each column

### .describe() function

In [42]:
sample.describe()     # gives u basic descriptive statistics

Unnamed: 0,Age,Salary
count,15.0,15.0
mean,28.333333,36733.333333
std,4.835385,13370.89946
min,21.0,10000.0
25%,25.0,29000.0
50%,29.0,35000.0
75%,31.5,47500.0
max,36.0,55000.0


### .info()

In [43]:
sample.info()             # info() to check null values in dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    15 non-null     object
 1   Age     15 non-null     int64 
 2   Gender  15 non-null     object
 3   Salary  15 non-null     int64 
dtypes: int64(2), object(2)
memory usage: 608.0+ bytes


### .mean()

In [47]:
sample.mean()

Age          28.333333
Salary    36733.333333
dtype: float64

### .sum()

In [45]:
sample.sum()

Name      HarryPotterJoeyRonnyJohnMonicaNickKarenSusanEm...
Age                                                     425
Gender                                      MMMMMFMFFFFMFMM
Salary                                               551000
dtype: object

### finding mean and sum of a particular column (by name) (sum(),mean(),std())

In [48]:
print("Total age of all the employees is:", sample["Age"].sum())
print("Average age is:", sample["Age"].mean())
print("Standard deviation in age is:", sample["Age"].std())

Total age of all the employees is: 425
Average age is: 28.333333333333332
Standard deviation in age is: 4.835385442852759


### Getting information about columns (attributes) through the keyword columns

In [49]:
sample.columns

Index(['Name', 'Age', 'Gender', 'Salary'], dtype='object')

In [50]:
sample.columns[2]

'Gender'

### Getting information about rows (obserations) through the keyword index

In [51]:
sample.index

RangeIndex(start=0, stop=15, step=1)

### len will give the total number of observations or samples

In [52]:
len(sample)

15

### Displaying the first five rows again

In [60]:
sample.head()

Unnamed: 0,Name,Age,Gender,Salary
0,Harry,25,M,10000
1,Potter,26,M,20000
2,Joey,25,M,30000
3,Ronny,29,M,50000
4,John,28,M,45000


### Changing a specific value in the dataframe
    Suppose that we need to replace the name Potter with Henery. This can be done by mentioning the column name and the row index

In [61]:
'''Change the value at specific row and column...'''
sample["Name"][1] = "Henery"
sample.head()

Unnamed: 0,Name,Age,Gender,Salary
0,Harry,25,M,10000
1,Henery,26,M,20000
2,Joey,25,M,30000
3,Ronny,29,M,50000
4,John,28,M,45000


### Rearranging the dataframe:
* When axis is kept as 0, sorting is done as per rows; 
* parameter ascending = False indicates that sorting will be done in descending order
* When axis is kept as 1, sorting is done as per columns; 

In [62]:
sample.sort_index(axis = 0, ascending = False) 

Unnamed: 0,Name,Age,Gender,Salary
14,Jacob,34,M,55000
13,Noah,33,M,42000
12,Emily,21,F,41000
11,Jack,23,M,25000
10,Linda,29,F,55000
9,Emma,36,F,52000
8,Susan,30,F,32000
7,Karen,35,F,31000
6,Nick,30,M,28000
5,Monica,21,F,35000


In [63]:
sample.sort_index(axis = 0, ascending = True)

Unnamed: 0,Name,Age,Gender,Salary
0,Harry,25,M,10000
1,Henery,26,M,20000
2,Joey,25,M,30000
3,Ronny,29,M,50000
4,John,28,M,45000
5,Monica,21,F,35000
6,Nick,30,M,28000
7,Karen,35,F,31000
8,Susan,30,F,32000
9,Emma,36,F,52000


### Displaying the dataframe as transpose

In [64]:
sample.T     # rows to columns and columns to rows

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
Name,Harry,Henery,Joey,Ronny,John,Monica,Nick,Karen,Susan,Emma,Linda,Jack,Emily,Noah,Jacob
Age,25,26,25,29,28,21,30,35,30,36,29,23,21,33,34
Gender,M,M,M,M,M,F,M,F,F,F,F,M,F,M,M
Salary,10000,20000,30000,50000,45000,35000,28000,31000,32000,52000,55000,25000,41000,42000,55000


## .loc[ ] and iloc[ ] for accessing the elements of a dataframe
### loc referes to the label
### iloc refers to the position 

### Understanding .loc[] and .iloc[] in a Pandas series

In [65]:
'''Let's Create a sample series'''
animals = pd.Series(["Cat", "Dog", "Rabbit", "Ox", "Lion"], 
                   index=[3, 5, 9, 8, 2])

#### In the above example index of Cat is 3; index of Dog is 5; index Rabbit is 9 and so on..... 

In [66]:
animals #### Displaying the series

3       Cat
5       Dog
9    Rabbit
8        Ox
2      Lion
dtype: object

### If we want to access an element with index label 3, we make use of .loc[]

In [67]:
'''.loc[] refers to the label'''
animals.loc[3]

'Cat'

### If we want to access the element at the third position, we make use of iloc[]

In [68]:
'''iloc[] refers to the position'''
animals.iloc[3]

'Ox'

## loc and iloc in a dataframe

In [69]:
sample

Unnamed: 0,Name,Age,Gender,Salary
0,Harry,25,M,10000
1,Henery,26,M,20000
2,Joey,25,M,30000
3,Ronny,29,M,50000
4,John,28,M,45000
5,Monica,21,F,35000
6,Nick,30,M,28000
7,Karen,35,F,31000
8,Susan,30,F,32000
9,Emma,36,F,52000


### .loc[] refers to the index

In [70]:
sample.loc[3]

Name      Ronny
Age          29
Gender        M
Salary    50000
Name: 3, dtype: object

In [71]:
sample.iloc[7]

Name      Karen
Age          35
Gender        F
Salary    31000
Name: 7, dtype: object

### Using the slicing operator

In [73]:
sample.loc[2:5]     # in accessing by labels, no upper & lower bound concept. all included as specified.

Unnamed: 0,Name,Age,Gender,Salary
2,Joey,25,M,30000
3,Ronny,29,M,50000
4,John,28,M,45000
5,Monica,21,F,35000


In [74]:
sample.iloc[0:4,0:2]

Unnamed: 0,Name,Age
0,Harry,25
1,Henery,26
2,Joey,25
3,Ronny,29


In [75]:
sample.iloc[2:5]

Unnamed: 0,Name,Age,Gender,Salary
2,Joey,25,M,30000
3,Ronny,29,M,50000
4,John,28,M,45000


In [76]:
sample.loc[0:2]['Age']

0    25
1    26
2    25
Name: Age, dtype: int64

In [77]:
sample.loc[2:5]["Age"]

2    25
3    29
4    28
5    21
Name: Age, dtype: int64

In [78]:
sample.head()

Unnamed: 0,Name,Age,Gender,Salary
0,Harry,25,M,10000
1,Henery,26,M,20000
2,Joey,25,M,30000
3,Ronny,29,M,50000
4,John,28,M,45000


### .loc[] and .iloc[] for assigning the value

In [79]:
sample.loc[1,'Age'] = 40
sample.head()

Unnamed: 0,Name,Age,Gender,Salary
0,Harry,25,M,10000
1,Henery,40,M,20000
2,Joey,25,M,30000
3,Ronny,29,M,50000
4,John,28,M,45000


In [80]:
sample.iloc[1,1] = 25
sample.head()

Unnamed: 0,Name,Age,Gender,Salary
0,Harry,25,M,10000
1,Henery,25,M,20000
2,Joey,25,M,30000
3,Ronny,29,M,50000
4,John,28,M,45000


### Creating a new column with the same value through out the DataFrame

In [81]:
sample["Occupation"] = "Employee"

In [82]:
sample

Unnamed: 0,Name,Age,Gender,Salary,Occupation
0,Harry,25,M,10000,Employee
1,Henery,25,M,20000,Employee
2,Joey,25,M,30000,Employee
3,Ronny,29,M,50000,Employee
4,John,28,M,45000,Employee
5,Monica,21,F,35000,Employee
6,Nick,30,M,28000,Employee
7,Karen,35,F,31000,Employee
8,Susan,30,F,32000,Employee
9,Emma,36,F,52000,Employee


## Creating a new column using data of other columns of the DataFrame

In [83]:
sample["Col"]  = sample["Age"] / sample["Salary"]

In [84]:
sample

Unnamed: 0,Name,Age,Gender,Salary,Occupation,Col
0,Harry,25,M,10000,Employee,0.0025
1,Henery,25,M,20000,Employee,0.00125
2,Joey,25,M,30000,Employee,0.000833
3,Ronny,29,M,50000,Employee,0.00058
4,John,28,M,45000,Employee,0.000622
5,Monica,21,F,35000,Employee,0.0006
6,Nick,30,M,28000,Employee,0.001071
7,Karen,35,F,31000,Employee,0.001129
8,Susan,30,F,32000,Employee,0.000937
9,Emma,36,F,52000,Employee,0.000692


### Drop the column from DataFrame...

In [85]:
'''Dropping the column named as "col"...axis = 1 specifies the column'''
sample = sample.drop("Col",axis = 1)

In [86]:
sample

Unnamed: 0,Name,Age,Gender,Salary,Occupation
0,Harry,25,M,10000,Employee
1,Henery,25,M,20000,Employee
2,Joey,25,M,30000,Employee
3,Ronny,29,M,50000,Employee
4,John,28,M,45000,Employee
5,Monica,21,F,35000,Employee
6,Nick,30,M,28000,Employee
7,Karen,35,F,31000,Employee
8,Susan,30,F,32000,Employee
9,Emma,36,F,52000,Employee


### Dropping the row named as "0"...axis = 0 specifies the row

In [87]:
sample = sample.drop(3 ,axis = 0)   # so u see index 4 after 2 in result since 3 is deleted.

In [88]:
sample

Unnamed: 0,Name,Age,Gender,Salary,Occupation
0,Harry,25,M,10000,Employee
1,Henery,25,M,20000,Employee
2,Joey,25,M,30000,Employee
4,John,28,M,45000,Employee
5,Monica,21,F,35000,Employee
6,Nick,30,M,28000,Employee
7,Karen,35,F,31000,Employee
8,Susan,30,F,32000,Employee
9,Emma,36,F,52000,Employee
10,Linda,29,F,55000,Employee


### Appending a new row into the DataFrame
#### Let's take an example a new DataFrame...

In [4]:
'''Create a new DataFrame'''

import pandas as pd
df_1 = pd.DataFrame({"A":[1, 2, 3], 
                    "B":[5, 6, 7]}) 

'''Create another DataFrame to append'''

df_2 = pd.DataFrame({"A":[10, 20, 30, 40], 
                    "B":[50, 60, 70, 80]}) 

In [5]:
print(df_1)
print(df_2)


   A  B
0  1  5
1  2  6
2  3  7
    A   B
0  10  50
1  20  60
2  30  70
3  40  80


In [6]:
df_3 = df_1.append(df_2)

In [7]:
df_3

Unnamed: 0,A,B
0,1,5
1,2,6
2,3,7
0,10,50
1,20,60
2,30,70
3,40,80


#### if you take a look on the index of df_3, you can see that while appending two DataFrames, it is also preserving their indexes/indices... to overcome this problem we can set ignore_index = True

In [8]:
df_3 = df_1.append(df_2, ignore_index= True)

In [9]:
df_3

Unnamed: 0,A,B
0,1,5
1,2,6
2,3,7
3,10,50
4,20,60
5,30,70
6,40,80


In [10]:
df_3.drop(3,axis=0) ### to bring the index in sequence again

Unnamed: 0,A,B
0,1,5
1,2,6
2,3,7
4,20,60
5,30,70
6,40,80


## Appending the Dataframes with different shapes

In [13]:
df_4 = pd.DataFrame({"A":[1, 2, 3, 4],       ## keys will be taken as columns
                    "B":[5, 6, 7, 8]}) 
print(df_4)
'''Create another DataFrame to append'''
df_5 = pd.DataFrame({"A":[10, 20, 30], 
                    "B":[50, 60, 70],
                    "C":[80,90,100]}) 
print(df_5)

   A  B
0  1  5
1  2  6
2  3  7
3  4  8
    A   B    C
0  10  50   80
1  20  60   90
2  30  70  100


In [14]:
df_6 = df_4.append(df_5, ignore_index = True)

### For unequal number of columns in the data frame, non-existent values will be filled with NaN values.

In [15]:
df_6

Unnamed: 0,A,B,C
0,1,5,
1,2,6,
2,3,7,
3,4,8,
4,10,50,80.0
5,20,60,90.0
6,30,70,100.0


## Dealing with Missing Values...

In [16]:
'''See if there is any Nan Value'''
df_6.isna()

Unnamed: 0,A,B,C
0,False,False,True
1,False,False,True
2,False,False,True
3,False,False,True
4,False,False,False
5,False,False,False
6,False,False,False


#### Count the number of missing values

In [17]:
df_6.isna().sum()

A    0
B    0
C    4
dtype: int64

### Imputation of missing values

In [22]:
'''Let's use the .fillna() function to fill the C column  with the mean of the other values in the same column...

Original df_6 will remain unchanged beacuse of inplace = False. 
if inplace = True, changes will be reflected to the original DataFrame...'''

df_6["C"].fillna(df_6["C"].mean(),inplace=False)
print(df_6)

    A   B      C
0   1   5    NaN
1   2   6    NaN
2   3   7    NaN
3   4   8    NaN
4  10  50   80.0
5  20  60   90.0
6  30  70  100.0


#### Drop the Missing Value Column... by default inplace=False

In [19]:
df_6.dropna()    # it will drop only NaN values rows


Unnamed: 0,A,B,C
4,10,50,80.0
5,20,60,90.0
6,30,70,100.0


In [23]:
df_7 = df_6
df_7['C'].dropna() 

4     80.0
5     90.0
6    100.0
Name: C, dtype: float64

In [112]:
'''Appending a new row in the sample DataFrame'''
s = pd.Series({'Name':'Harry','Age':30, 'Gender':'M','Salary':32000})

In [113]:
sample = sample.append(s, ignore_index = True)

In [114]:
sample

Unnamed: 0,Name,Age,Gender,Salary,Occupation
0,Harry,25,M,10000,Employee
1,Henery,25,M,20000,Employee
2,Joey,25,M,30000,Employee
3,John,28,M,45000,Employee
4,Monica,21,F,35000,Employee
5,Nick,30,M,28000,Employee
6,Karen,35,F,31000,Employee
7,Susan,30,F,32000,Employee
8,Emma,36,F,52000,Employee
9,Linda,29,F,55000,Employee


In [115]:
sample['Occupation'].fillna('Employee', inplace = True)

In [116]:
sample

Unnamed: 0,Name,Age,Gender,Salary,Occupation
0,Harry,25,M,10000,Employee
1,Henery,25,M,20000,Employee
2,Joey,25,M,30000,Employee
3,John,28,M,45000,Employee
4,Monica,21,F,35000,Employee
5,Nick,30,M,28000,Employee
6,Karen,35,F,31000,Employee
7,Susan,30,F,32000,Employee
8,Emma,36,F,52000,Employee
9,Linda,29,F,55000,Employee


### Queries over the DataFrame

In [117]:
sample.loc[(sample['Age']>25)]

Unnamed: 0,Name,Age,Gender,Salary,Occupation
3,John,28,M,45000,Employee
5,Nick,30,M,28000,Employee
6,Karen,35,F,31000,Employee
7,Susan,30,F,32000,Employee
8,Emma,36,F,52000,Employee
9,Linda,29,F,55000,Employee
12,Noah,33,M,42000,Employee
13,Jacob,34,M,55000,Employee
14,Harry,30,M,32000,Employee


In [118]:
sample.groupby(['Gender']).mean()   # groupby() used for EDA, display data in compact format

Unnamed: 0_level_0,Age,Salary
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
F,28.666667,41000.0
M,28.111111,31888.888889


#### Rename the Column....Rename the column named as "Occupation

In [119]:
sample.rename(columns = {'Occupation':'Designation'}, inplace = False)

Unnamed: 0,Name,Age,Gender,Salary,Designation
0,Harry,25,M,10000,Employee
1,Henery,25,M,20000,Employee
2,Joey,25,M,30000,Employee
3,John,28,M,45000,Employee
4,Monica,21,F,35000,Employee
5,Nick,30,M,28000,Employee
6,Karen,35,F,31000,Employee
7,Susan,30,F,32000,Employee
8,Emma,36,F,52000,Employee
9,Linda,29,F,55000,Employee


In [120]:
sample

Unnamed: 0,Name,Age,Gender,Salary,Occupation
0,Harry,25,M,10000,Employee
1,Henery,25,M,20000,Employee
2,Joey,25,M,30000,Employee
3,John,28,M,45000,Employee
4,Monica,21,F,35000,Employee
5,Nick,30,M,28000,Employee
6,Karen,35,F,31000,Employee
7,Susan,30,F,32000,Employee
8,Emma,36,F,52000,Employee
9,Linda,29,F,55000,Employee


In [121]:
#df.sort_values(by=['Brand'], inplace=True)

sample.sort_values('Name')   # sorting w.r.t strings as well as numeric possible

Unnamed: 0,Name,Age,Gender,Salary,Occupation
11,Emily,21,F,41000,Employee
8,Emma,36,F,52000,Employee
0,Harry,25,M,10000,Employee
14,Harry,30,M,32000,Employee
1,Henery,25,M,20000,Employee
10,Jack,23,M,25000,Employee
13,Jacob,34,M,55000,Employee
2,Joey,25,M,30000,Employee
3,John,28,M,45000,Employee
6,Karen,35,F,31000,Employee


In [122]:
sample

Unnamed: 0,Name,Age,Gender,Salary,Occupation
0,Harry,25,M,10000,Employee
1,Henery,25,M,20000,Employee
2,Joey,25,M,30000,Employee
3,John,28,M,45000,Employee
4,Monica,21,F,35000,Employee
5,Nick,30,M,28000,Employee
6,Karen,35,F,31000,Employee
7,Susan,30,F,32000,Employee
8,Emma,36,F,52000,Employee
9,Linda,29,F,55000,Employee


In [123]:
sample.sort_values('Age')

Unnamed: 0,Name,Age,Gender,Salary,Occupation
4,Monica,21,F,35000,Employee
11,Emily,21,F,41000,Employee
10,Jack,23,M,25000,Employee
0,Harry,25,M,10000,Employee
1,Henery,25,M,20000,Employee
2,Joey,25,M,30000,Employee
3,John,28,M,45000,Employee
9,Linda,29,F,55000,Employee
5,Nick,30,M,28000,Employee
7,Susan,30,F,32000,Employee


In [124]:
sample.sort_values( by = ['Age', 'Name'])

Unnamed: 0,Name,Age,Gender,Salary,Occupation
11,Emily,21,F,41000,Employee
4,Monica,21,F,35000,Employee
10,Jack,23,M,25000,Employee
0,Harry,25,M,10000,Employee
1,Henery,25,M,20000,Employee
2,Joey,25,M,30000,Employee
3,John,28,M,45000,Employee
9,Linda,29,F,55000,Employee
14,Harry,30,M,32000,Employee
5,Nick,30,M,28000,Employee


## Write your SQL Queries in pandas
### Pandas Query: pandas.query()  # Queries for checking condition
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html

#### To get the data of the Employees having Age between 25 an 30

In [125]:
sample.query('Age > 25 and Age < 30')

Unnamed: 0,Name,Age,Gender,Salary,Occupation
3,John,28,M,45000,Employee
9,Linda,29,F,55000,Employee


#### To get the Age of the Employee named Monica

In [126]:
sample.query("Name == 'Monica'")['Age' ]  # If name=Monika, then print age

4    21
Name: Age, dtype: int64

#### To get the Age and Occupation of the Employee named Monica

In [127]:
sample.query("Name == 'Monica'")[['Age' , 'Occupation']]

Unnamed: 0,Age,Occupation
4,21,Employee


### AND operator
#Get the data of the Employee having Age>30 and Gennder = Male

In [128]:
sample.query("(Age > 30) and (Gender =='M') ")

Unnamed: 0,Name,Age,Gender,Salary,Occupation
12,Noah,33,M,42000,Employee
13,Jacob,34,M,55000,Employee


### OR operator
* Get the data of the Employee either having Age == 30 or Gender == Male

In [129]:
'''OR operator'''
#Get the data of the Employee either having Age == 30 or Gennder = Male
sample.query("(Age == 30) or (Gender =='M') ")

Unnamed: 0,Name,Age,Gender,Salary,Occupation
0,Harry,25,M,10000,Employee
1,Henery,25,M,20000,Employee
2,Joey,25,M,30000,Employee
3,John,28,M,45000,Employee
5,Nick,30,M,28000,Employee
7,Susan,30,F,32000,Employee
10,Jack,23,M,25000,Employee
12,Noah,33,M,42000,Employee
13,Jacob,34,M,55000,Employee
14,Harry,30,M,32000,Employee


## References

1. https://pandas.pydata.org/docs/index.html
2. https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html 
3. https://medium.com/jbennetcodes/how-to-rewrite-your-sql-queries-in-pandas-and-more-149d341fc53e  - For writing sql queries in Pandas
4. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and Ipython by Wes Mc Kinney, 2016 – Reference Book

### Pandas is very efficient with small data (usually from 100MB up to 1GB) and performance is rarely a concern

#### You can work with datasets that are much larger than memory, as long as each partition (a regular pandas DataFrame) fits in memory. By default, dask. dataframe operations use a threadpool to do operations in parallel.

##### Dask for big data - Dask is a flexible library for parallel computing in Python. Dask is composed of two parts: Dynamic task scheduling optimized for computation.