![pandas](https://i.redd.it/c6h7rok9c2v31.jpg)
*By Shreeyansh Das, Source: gfg, Pandas Documentation*

Pandas is an open-source library that is built on top of NumPy library. It is a Python package that offers various data structures and operations for manipulating numerical data and time series. It is mainly popular for importing and analyzing data much easier. Pandas is fast and it has high-performance & productivity for users.

**Advantages** 
* Fast and efficient for manipulating and analyzing data. 
* Data from different file objects can be loaded. 
* Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data 
* *Size mutability*: columns can be inserted and deleted from DataFrame and higher dimensional objects 
* Data set merging and joining. 
* Flexible reshaping and pivoting of data sets 
* Provides time-series functionality. 
* Powerful group by functionality for performing split-apply-combine operations on data sets. 

**Why Pandas is used for Data Science** - 
This is because pandas are used in conjunction with other libraries that are used for data science. It is built on the top of the NumPy library which means that a lot of structures of NumPy are used or replicated in Pandas. The data produced by Pandas are often used as input for plotting functions of Matplotlib, statistical analysis in SciPy, machine learning algorithms in Scikit-learn.
Pandas program can be run from any text editor but it is recommended to use Jupyter Notebook for this as Jupyter given the ability to execute code in a particular cell rather than executing the entire file. Jupyter also provides an easy way to visualize pandas data frames and plots.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

---
# 1. Introduction
---
Pandas generally provide two data structures for manipulating data, They are: 
* Series 
* DataFrame 

## 1.1 Series
Pandas Series is a one-dimensional labelled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called indexes. **Pandas Series is nothing but a column in an excel sheet**. Labels need not be unique but must be a hashable type. The object supports both integer and label-based indexing and provides a host of methods for performing operations involving the index.

![series](https://media.geeksforgeeks.org/wp-content/uploads/20200225170506/pandas-series.png)

### 1.1.1 Creating a Pandas Series
In the real world, a Pandas Series will be created by loading the datasets from existing storage, storage can be SQL Database, CSV file, and Excel file. Pandas Series can be created from the lists, dictionary, and from a scalar value etc. Series can be created in different ways, here are some ways by which we create a series:

**Creating a series from array** : In order to create a series from array, we have to import a numpy module and have to use `array()` function.

In [2]:
arr = np.array(['Volt','Excalibur','Gauss','Mag','Xaku','Revenant','Rhino','Garuda','Loki','Nidus'])

In [3]:
S = pd.Series(arr)
S

0         Volt
1    Excalibur
2        Gauss
3          Mag
4         Xaku
5     Revenant
6        Rhino
7       Garuda
8         Loki
9        Nidus
dtype: object

**Creating series from list** :  In order to create a series from list, we have to first create a list after that we can create a series from list.

In [4]:
lst = list(arr)

In [5]:
L = pd.Series(lst)
L

0         Volt
1    Excalibur
2        Gauss
3          Mag
4         Xaku
5     Revenant
6        Rhino
7       Garuda
8         Loki
9        Nidus
dtype: object

**Creating Series From Dictionary** : In order to create a series from dictionary, we have to first create a dictionary after that we can make a series using dictionary. Dictionary key are used to construct a index.

In [6]:
dct = {1:'Nidus', 2:'Excalibur Umbra', 3:'Volt Prime', 4:'Gauss Prime', 5: 'Atlas Prime'}
pd.Series(dct)

1              Nidus
2    Excalibur Umbra
3         Volt Prime
4        Gauss Prime
5        Atlas Prime
dtype: object

**Creating a series from Scalar value** : In order to create a series from scalar value, an index must be provided. The scalar value will be repeated to match the length of index.

In [7]:
pd.Series(10, index = np.asarray(np.arange(0,6,1)))

0    10
1    10
2    10
3    10
4    10
5    10
dtype: int64

### 1.1.2 Accessing Elements
There are two ways through which we can access element of series, they are :
* Accessing Element from Series with Position
* Accessing Element Using Label (index)

**Accessing Elements via Position** :  In order to access the series element refers to the index number. Use the index operator `[ ]` to access an element in a series. The index must be an integer. In order to access multiple elements from a series, we use Slice operation.

In [8]:
# Accessing 1st five elements
S[:5]

0         Volt
1    Excalibur
2        Gauss
3          Mag
4         Xaku
dtype: object

In [9]:
# Accessing elements from position 4 to position 7
S[4:8]

4        Xaku
5    Revenant
6       Rhino
7      Garuda
dtype: object

**Accessing Elements via Labels** :  In order to access an element from series, we have to set values by index label. A Series is like a fixed-size dictionary in that you can get and set values by index label.

In [10]:
data = np.array(['g','e','e','k','s','f', 'o','r','g','e','e','k','s'])
ser = pd.Series(data, index = np.asarray(np.arange(10,23,1)))

In [11]:
ser[15]

'f'

### 1.1.3. Indexing and Selecting Data
Indexing in pandas means simply selecting particular data from a Series. Indexing could mean selecting all the data, some of the data from particular columns. Indexing can also be known as Subset Selection. It can be done via 3 methods
* Index Operator - `[ ]`
* Using `.loc`
* Using `.iloc`

**Index Operator** : 
Indexing operator is used to refer to the square brackets following an object. In this indexing operator to refer to `df[ ]`

In [12]:
S[4:9]                                         #from index 4 to index 8

4        Xaku
5    Revenant
6       Rhino
7      Garuda
8        Loki
dtype: object

**`df.loc[ ]` Method** : This function selects data by refering the explicit index . The `df.loc[ ]` indexer selects data in a different way than just the indexing operator. It can select subsets of data. *It doesnt excludes the last index.*

In [13]:
S.loc[4:9]                                   #from index 4 to index 9

4        Xaku
5    Revenant
6       Rhino
7      Garuda
8        Loki
9       Nidus
dtype: object

**`df.iloc[ ]` Method** : This function allows us to retrieve data by position. In order to do that, we’ll need to specify the positions of the data that we want. The `df.iloc` indexer is very similar to `df.loc` but only uses integer locations to make its selections. *This is similar to index operator since it excludes the last index*

In [14]:
S.iloc[4:9]                               #from index 4 to index 8

4        Xaku
5    Revenant
6       Rhino
7      Garuda
8        Loki
dtype: object

The main distinction between `loc` and `iloc` is:
* `loc` is label-based, which means that you have to specify rows and columns based on their row and column labels.
* `iloc` is integer position-based, so you have to specify rows and columns by their integer position values (0-based integer position).

![difference](https://miro.medium.com/max/2000/1*CgAWzayEQY8PQuMpRkSGfQ.png)

### 1.1.4 Binary Operations on Series
We can perform binary operation on series like addition, subtraction and many other operation. In order to perform binary operation on series we have to use some function like `.add()`,`.sub()`,`.mul()`, etc.

* `add()` - Method is used to add series or list like objects with same length to the caller series
* `sub()` - Method is used to subtract series or list like objects with same length from the caller series
* `mul()` - Method is used to multiply series or list like objects with same length with the caller series
* `div()` - Method is used to divide series or list like objects with same length by the caller series
* `sum()` - Returns the sum of the values for the requested axis
* `prod()`- Returns the product of the values for the requested axis
* `mean()`- Returns the mean of the values for the requested axis
* `pow()` - Method is used to put each element of passed series as exponential power of caller series and returned the results
* `abs()` - Method is used to get the absolute numeric value of each element in Series/DataFrame
* `cov()` - Method is used to find covariance of two series

For non matching indices, pass the `fill_value` argument.

In [15]:
d1 = pd.Series([5, 2, 3,7], index=['a', 'b', 'c', 'd'])
d2 = pd.Series([1, 6, 4, 9], index=['a', 'b', 'd', 'e'])

In [16]:
d1.add(d2, fill_value = 0)

a     6.0
b     8.0
c     3.0
d    11.0
e     9.0
dtype: float64

In [17]:
d1.sub(d2, fill_value = 0)

a    4.0
b   -4.0
c    3.0
d    3.0
e   -9.0
dtype: float64

### 1.1.4 Conversion Operation on Series
In conversion operation we perform various operation like changing datatype of series, changing a series to list etc. In order to perform conversion operation we have various function which help in conversion like `.astype()`, `.tolist()` etc. Changing the Series into numpy array is acheived by using a method `Series.to_numpy()` or `Series.as_matrix()`.

In [18]:
iris = sns.load_dataset('iris')

In [19]:
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [20]:
iris['sepal_width'].astype(int)

0      3
1      3
2      3
3      3
4      3
      ..
145    3
146    2
147    3
148    3
149    3
Name: sepal_width, Length: 150, dtype: int32

In [21]:
iris['petal_length'].tolist()[:10]     #showing only first 10 conversions

[1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5]

In [22]:
iris['petal_length'].to_numpy()[:10]

array([1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5])

### 1.1.5 Miscellaneous Pandas Series Operations

In [23]:
iris['sepal_length'].count()      #Returns number of non-NA/null observations in the Series

150

In [24]:
iris.size                         #Returns the number of elements in the underlying data

750

In [25]:
S.name = "Warframes"              #Method allows to give a name to a Series object, i.e. to the column

In [26]:
iris['sepal_length'].is_unique    #Method returns boolean if values in the object are unique

False

In [27]:
iris['species'].unique()        #to see the unique values in a particular column

array(['setosa', 'versicolor', 'virginica'], dtype=object)

In [28]:
iris['species'].nunique()  #to see the no. of unique values in a particular column

3

In [29]:
iris['sepal_length'].idxmin()  #To see the idx with minimum value in a series, for max use Series.idxmax()

13

* **Comapre Series** - `le()`, `ge()`, `lt()`, `gt()`, `eq()`, `ne()`

In [30]:
a = pd.Series([1, 1, 1, np.nan], index = ['a', 'b', 'c', 'd'])
b = pd.Series([2, np.nan, 0.5, np.nan], index = ['a', 'b', 'd', 'e'])

a.le(b, fill_value = 0) #Used to compare every element of Caller series (a here) less than or equal to passed series (b here)

a     True
b    False
c    False
d     True
e    False
dtype: bool

In [31]:
a = pd.Series([1, 1, 1, np.nan], index = ['a', 'b', 'c', 'd'])
b = pd.Series([1, np.nan, 1, np.nan], index = ['a', 'b', 'd', 'e'])

a.eq(b, fill_value = 0) #returns True for every element in Caller Series which is Equal to the element in passed series

a     True
b    False
c    False
d    False
e    False
dtype: bool

In [32]:
a.eq(b, fill_value = 0) #Return Equal to of series and other, element-wise (binary operator eq).

a     True
b    False
c    False
d    False
e    False
dtype: bool

* **Check for Empty Dataframe**

In [33]:
iris = sns.load_dataset('iris') 
iris.empty

False

* **Check for NaN's**

`df.hasnans` - Return if true have any nans; enables various perf speedups.

In [34]:
b.hasnans 

True

* **Drop NaN's**

`Series.dropna(axis=0, inplace=False, how=None)` - Return a new Series with missing values removed.
Parameters
1. axis{*0 or ‘index’*}, default 0 - There is only one axis to drop values from.

2. inplacebool, default False - If True, do operation inplace and return None.

3. howstr, optional - Not in use. Kept for compatibility.

In [35]:
ser = pd.Series([1., 2., 3., 4., np.nan])
ser.dropna()

0    1.0
1    2.0
2    3.0
3    4.0
dtype: float64

* **Fill NaN's**

`Series.fillna(value, method, axis, inplace)` - Fill NA/NaN values using the specified method.

1. valuescalar, dict, Series, or DataFrame - Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). Values not in the dict/Series/DataFrame will not be filled. This value cannot be a list.

2. method{‘*backfill*’, ‘*bfill*’, ‘*pad*’, ‘*ffill*’, *None*}, default None - Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use next valid observation to fill gap.

3. axis{0 or ‘index’} - Axis along which to fill missing values.

In [36]:
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
                   [3, 4, np.nan, 1],
                   [np.nan, np.nan, np.nan, 5],
                   [np.nan, 3, np.nan, 4]],
                  columns=list("ABCD"))
df

Unnamed: 0,A,B,C,D
0,,2.0,,0
1,3.0,4.0,,1
2,,,,5
3,,3.0,,4


In [37]:
df.fillna(0) #fill all NaN with 0

Unnamed: 0,A,B,C,D
0,0.0,2.0,0.0,0
1,3.0,4.0,0.0,1
2,0.0,0.0,0.0,5
3,0.0,3.0,0.0,4


In [38]:
df.fillna(value = {"A": 0, "B": 1, "C": 2, "D": 3}) #Replace all NaN elements with specific values

Unnamed: 0,A,B,C,D
0,0.0,2.0,2.0,0
1,3.0,4.0,2.0,1
2,0.0,1.0,2.0,5
3,0.0,3.0,2.0,4


* **Combine 2 series**

`s1.combine(s2, func, fill_value)` - Combine the Series with a Series or scalar according to func. Combine the Series and other using func to perform elementwise selection for combined Series. fill_value is assumed when value is missing at some index from one of the two objects being combined.

In [39]:
s1 = pd.Series({'falcon': 330.0, 'eagle': 160.0})
s2 = pd.Series({'falcon': 345.0, 'eagle': 200.0, 'duck': 30.0})
s1.combine(s2, max)

duck        NaN
eagle     200.0
falcon    345.0
dtype: float64

* **Removing Series Elements**

`Series.drop(labels, axis, index, columns)` - Remove elements of a Series based on specifying the index labels. When using a multi-index, labels on different levels can be removed by specifying the level.

`labels` - Index labels to drop.

`axis`, default 0 - Redundant for application on Series.

`index` - Redundant for application on Series, but ‘index’ can be used instead of ‘labels’.

`columns` - No change is made to the Series; use ‘index’ or ‘labels’ instead.

In [40]:
s = pd.Series(data=np.arange(5), index=['A', 'B', 'C', 'D', 'E'])
s.drop(labels=['B', 'C'])

A    0
D    3
E    4
dtype: int32

* **Find Series Duplicates**

Indicate duplicate Series values.

`Series.duplicated(keep = 'first')` - Duplicated values are indicated as True values in the resulting Series. Either all duplicates, all except the first or all except the last occurrence of duplicates can be indicated.

Parameters - keep{‘first’, ‘last’, False}, default ‘first’. Method to handle dropping duplicates:

1. ‘first’ : Mark duplicates as True except for the first occurrence.

2. ‘last’ : Mark duplicates as True except for the last occurrence.

3. False : Mark all duplicates as True.

In [41]:
animals = pd.Series(['lama', 'cow', 'lama', 'beetle', 'lama'])
animals.duplicated()

0    False
1    False
2     True
3    False
4     True
dtype: bool

In [42]:
animals.duplicated(keep = False)

0     True
1    False
2     True
3    False
4     True
dtype: bool

## 1.2 DataFrame
Pandas DataFrame is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components -  the data, rows, and columns.
![dataframe](https://media.geeksforgeeks.org/wp-content/cdn-uploads/creating_dataframe1.png)

### 1.2.1 Creating Pandas DataFrame
In the real world, a Pandas DataFrame will be created by loading the datasets from existing storage, storage can be SQL Database, CSV file, and Excel file. However, Pandas DataFrame can be created from the lists, dictionary, and from a list of dictionary etc. 

* **Empty Dataframe**

In [43]:
pd.DataFrame()

* **Dataframe from List**

In [44]:
np.asarray(lst).T #list

array(['Volt', 'Excalibur', 'Gauss', 'Mag', 'Xaku', 'Revenant', 'Rhino',
       'Garuda', 'Loki', 'Nidus'], dtype='<U9')

In [45]:
pd.DataFrame(lst) #Dataframe from list

Unnamed: 0,0
0,Volt
1,Excalibur
2,Gauss
3,Mag
4,Xaku
5,Revenant
6,Rhino
7,Garuda
8,Loki
9,Nidus


* **Dataframe from Dictionary**
To create DataFrame from dict of narray/list, all the narray must be of same length. If index is passed then the length index should be equal to the length of arrays. If no index is passed, then by default, index will be range(n) where n is the array length. 

In [46]:
data = {'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18]}

In [47]:
pd.DataFrame(data)

Unnamed: 0,Name,Age
0,Tom,20
1,nick,21
2,krish,19
3,jack,18


### 1.2.2. Rows and Columns Selection
* **Rows Selection**  Pandas provide a unique method to retrieve rows from a Data frame. `DataFrame.loc[]` method is used to retrieve rows from Pandas DataFrame. Rows can also be selected by passing integer location to an `iloc[]` function.

In [48]:
iris.loc[3]             # 3rd index record

sepal_length       4.6
sepal_width        3.1
petal_length       1.5
petal_width        0.2
species         setosa
Name: 3, dtype: object

**OR**

In [49]:
iris.loc[:4]             # First 5 records

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [50]:
iris.loc[:4,['sepal_length','species']]  # First 5 records with specific columns

Unnamed: 0,sepal_length,species
0,5.1,setosa
1,4.9,setosa
2,4.7,setosa
3,4.6,setosa
4,5.0,setosa


In [51]:
iris.loc[[3,5,7]]  #Specific Records - note the double brackets

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
3,4.6,3.1,1.5,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
7,5.0,3.4,1.5,0.2,setosa


In [52]:
iris.sample(frac = 0.5).head() #Randomly select fraction of rows.

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
84,5.4,3.0,4.5,1.5,versicolor
52,6.9,3.1,4.9,1.5,versicolor
27,5.2,3.5,1.5,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
104,6.5,3.0,5.8,2.2,virginica


In [53]:
iris.sample(n = 34).head(3)  #Randomly select 'n' rows

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
125,7.2,3.2,6.0,1.8,virginica
119,6.0,2.2,5.0,1.5,virginica
124,6.7,3.3,5.7,2.1,virginica


* **Column Selection** In Order to select a column in Pandas DataFrame, we can either access the columns by calling them by their columns name.

In [54]:
iris['sepal_width'][:6] # First 6 entries in column named 'sepal_width'

0    3.5
1    3.0
2    3.2
3    3.1
4    3.6
5    3.9
Name: sepal_width, dtype: float64

In [55]:
iris[iris.columns[1:3]].head() #Select 2nd to 3rd column.

Unnamed: 0,sepal_width,petal_length
0,3.5,1.4
1,3.0,1.4
2,3.2,1.3
3,3.1,1.5
4,3.6,1.4


In [56]:
iris.loc[3:9,'sepal_width':'petal_width']  #Continuous Selection of columns

Unnamed: 0,sepal_width,petal_length,petal_width
3,3.1,1.5,0.2
4,3.6,1.4,0.2
5,3.9,1.7,0.4
6,3.4,1.4,0.3
7,3.4,1.5,0.2
8,2.9,1.4,0.2
9,3.1,1.5,0.1


* **Explicit Selection**


In [57]:
iris.iloc[0:5, 1:3]

Unnamed: 0,sepal_width,petal_length
0,3.5,1.4
1,3.0,1.4
2,3.2,1.3
3,3.1,1.5
4,3.6,1.4


### 1.2.3 Handling Missing Data
Missing Data can occur when no information is provided for one or more items or for a whole unit. Missing Data is a very big problem in real life scenario. Missing Data can also refer to as NA(Not Available) values in pandas.

* **Checking for missing values using `isnull()` and `notnull()`** : In order to check missing values in Pandas DataFrame, we use a function `isnull()` and `notnull()`. Both function help in checking whether a value is NaN or not. These function can also be used in Pandas Series in order to find null values in a series.

In [58]:
df

Unnamed: 0,A,B,C,D
0,,2.0,,0
1,3.0,4.0,,1
2,,,,5
3,,3.0,,4


In [59]:
df.isnull()

Unnamed: 0,A,B,C,D
0,True,False,True,False
1,False,False,True,False
2,True,True,True,False
3,True,False,True,False


* **Fill missing values using `fillna()`,`replace()`,`interpolate()`** : These functions replace NaN values with some value of their own. All these function help in filling a null values in datasets of a DataFrame. `Interpolate()` function is basically used to fill NA values in the dataframe but it uses various interpolation technique to fill the missing values rather than hard-coding the value.

In [60]:
df.fillna(0.0)

Unnamed: 0,A,B,C,D
0,0.0,2.0,0.0,0
1,3.0,4.0,0.0,1
2,0.0,0.0,0.0,5
3,0.0,3.0,0.0,4


In [61]:
df.interpolate()

Unnamed: 0,A,B,C,D
0,,2.0,,0
1,3.0,4.0,,1
2,3.0,3.5,,5
3,3.0,3.0,,4


As we can see the output, values in the first row could not get filled as the direction of filling of values is forward and there is no previous value which could have been used in interpolation.

In [62]:
df.fillna(method = 'pad') #Filling null values with the previous ones

Unnamed: 0,A,B,C,D
0,,2.0,,0
1,3.0,4.0,,1
2,3.0,4.0,,5
3,3.0,3.0,,4


In [63]:
df.fillna(method ='bfill') #Filling null value with the next ones

Unnamed: 0,A,B,C,D
0,3.0,2.0,,0
1,3.0,4.0,,1
2,,3.0,,5
3,,3.0,,4


* **Dropping missing values using `dropna()`** : In order to drop a null values from a dataframe, we used `dropna()` function this fuction drop Rows/Columns of datasets with Null values in different ways.

In [64]:
dict = {'First Score':[100, 90, np.nan, 95],
        'Second Score': [30, np.nan, 45, 56],
        'Third Score':[52, 40, 80, 98],
        'Fourth Score':[np.nan, np.nan, np.nan, 65],
        'Fifth Score':[12, 78, 45, 90]}

In [65]:
DT = pd.DataFrame(dict)
DT

Unnamed: 0,First Score,Second Score,Third Score,Fourth Score,Fifth Score
0,100.0,30.0,52,,12
1,90.0,,40,,78
2,,45.0,80,,45
3,95.0,56.0,98,65.0,90


In [66]:
dict_2 = {'First Score':[100, np.nan, np.nan, 95],
        'Second Score': [30, np.nan, 45, 56],
        'Third Score':[52, np.nan, 80, 98],
        'Fourth Score':[np.nan, np.nan, np.nan, 65]}
DT_2 = pd.DataFrame(dict_2)
DT_2

Unnamed: 0,First Score,Second Score,Third Score,Fourth Score
0,100.0,30.0,52.0,
1,,,,
2,,45.0,80.0,
3,95.0,56.0,98.0,65.0


In [67]:
DT_2.dropna(axis = 0, how = 'all') # drop rows whose all data is missing or contain null values(NaN)

Unnamed: 0,First Score,Second Score,Third Score,Fourth Score
0,100.0,30.0,52.0,
2,,45.0,80.0,
3,95.0,56.0,98.0,65.0


In [68]:
DT.dropna()                       #Dropping rows with atleast 1 NaN

Unnamed: 0,First Score,Second Score,Third Score,Fourth Score,Fifth Score
3,95.0,56.0,98,65.0,90


In [69]:
DT.dropna(axis = 1)               #Dropping Columns with with atleast 1 NaN

Unnamed: 0,Third Score,Fifth Score
0,52,12
1,40,78
2,80,45
3,98,90


### 1.2.4 Iterating over Dataframe
Pandas DataFrame consists of rows and columns so, in order to iterate over dataframe, we have to iterate a dataframe like a dictionary.
* **Iterating Over Rows** : In order to iterate over rows, we can use three function `iteritems()`, `iterrows()`, `itertuples()` . These three function will help in iteration over rows.

In [70]:
for i,j in iris.iterrows():
    print(i,j)

0 sepal_length       5.1
sepal_width        3.5
petal_length       1.4
petal_width        0.2
species         setosa
Name: 0, dtype: object
1 sepal_length       4.9
sepal_width        3.0
petal_length       1.4
petal_width        0.2
species         setosa
Name: 1, dtype: object
2 sepal_length       4.7
sepal_width        3.2
petal_length       1.3
petal_width        0.2
species         setosa
Name: 2, dtype: object
3 sepal_length       4.6
sepal_width        3.1
petal_length       1.5
petal_width        0.2
species         setosa
Name: 3, dtype: object
4 sepal_length       5.0
sepal_width        3.6
petal_length       1.4
petal_width        0.2
species         setosa
Name: 4, dtype: object
5 sepal_length       5.4
sepal_width        3.9
petal_length       1.7
petal_width        0.4
species         setosa
Name: 5, dtype: object
6 sepal_length       4.6
sepal_width        3.4
petal_length       1.4
petal_width        0.3
species         setosa
Name: 6, dtype: object
7 sepal_length      

89 sepal_length           5.5
sepal_width            2.5
petal_length           4.0
petal_width            1.3
species         versicolor
Name: 89, dtype: object
90 sepal_length           5.5
sepal_width            2.6
petal_length           4.4
petal_width            1.2
species         versicolor
Name: 90, dtype: object
91 sepal_length           6.1
sepal_width            3.0
petal_length           4.6
petal_width            1.4
species         versicolor
Name: 91, dtype: object
92 sepal_length           5.8
sepal_width            2.6
petal_length           4.0
petal_width            1.2
species         versicolor
Name: 92, dtype: object
93 sepal_length           5.0
sepal_width            2.3
petal_length           3.3
petal_width            1.0
species         versicolor
Name: 93, dtype: object
94 sepal_length           5.6
sepal_width            2.7
petal_length           4.2
petal_width            1.3
species         versicolor
Name: 94, dtype: object
95 sepal_length           5.

* **Iterating over Columns** :In order to iterate over columns, we need to create a list of dataframe columns and then iterate through that list to pull out the dataframe columns.

In [71]:
list(iris)

['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

In [72]:
for i in list(iris):
    print(iris[i][2])    #print details of 3rd record

4.7
3.2
1.3
0.2
setosa


**which is same as**

In [73]:
iris.loc[2]

sepal_length       4.7
sepal_width        3.2
petal_length       1.3
petal_width        0.2
species         setosa
Name: 2, dtype: object

### 1.2.5 Miscellaneous Dataframe Operations

* **First 5 Entries**

In [74]:
iris.head() 

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


* **Last 5 entries**

In [75]:
iris.tail()  

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica
149,5.9,3.0,5.1,1.8,virginica


* **Descriptive statistics.**

In [76]:
iris.describe()  

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [77]:
iris.index  #The index (row labels) of the DataFrame.

RangeIndex(start=0, stop=150, step=1)

In [78]:
iris.columns #The column labels of the DataFrame.

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')

**OR**

In [79]:
list(iris)

['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

Use `sorted(df)` to get column names in alphabetical order.

* **No. of Columns**

In [80]:
len(iris.columns)  

5

* **No. of Rows**

In [81]:
len(iris)  #no. of rows in dataframe

150

* **Concise summary of a DataFrame.**

In [82]:
iris.info()  

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


* **Datatypes of DataFrame**

In [83]:
iris.dtypes  

sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
species          object
dtype: object

* **Numpy Representation of DataFrame**

In [84]:
iris.values[:5]  

array([[5.1, 3.5, 1.4, 0.2, 'setosa'],
       [4.9, 3.0, 1.4, 0.2, 'setosa'],
       [4.7, 3.2, 1.3, 0.2, 'setosa'],
       [4.6, 3.1, 1.5, 0.2, 'setosa'],
       [5.0, 3.6, 1.4, 0.2, 'setosa']], dtype=object)

In [85]:
type(iris.values[:5])

numpy.ndarray

In [86]:
iris.ndim #Return an int representing the number of axes / array dimensions.

2

In [87]:
iris.size  #Return an int representing the number of elements in this object.

750

In [88]:
iris.memory_usage()   #Return the memory usage of each column in bytes.

Index            128
sepal_length    1200
sepal_width     1200
petal_length    1200
petal_width     1200
species         1200
dtype: int64

* **Convert to NumPy Array**

DataFrame can be converted to NumPy ndarray with the help of `Dataframe.to_numpy()` method.

In [89]:
iris.to_numpy()[:5]

array([[5.1, 3.5, 1.4, 0.2, 'setosa'],
       [4.9, 3.0, 1.4, 0.2, 'setosa'],
       [4.7, 3.2, 1.3, 0.2, 'setosa'],
       [4.6, 3.1, 1.5, 0.2, 'setosa'],
       [5.0, 3.6, 1.4, 0.2, 'setosa']], dtype=object)

* **Accessing Specific Value**

Pandas `at[]` is used to return data in a dataframe at the passed location. The passed location is in the format [position, Column Name]. This method works in a similar way to Pandas `loc[ ]` but `at[ ]` is used to return an only single value and hence works faster than it.

*Note*

1. Unlike, `DataFrame.loc[ ]`, this method only returns single value. Hence `DataFrame.at[3:6, label]` will return an error.
2. Since this method only works for single values, it is faster than `DataFrame.loc[]` method.

In [90]:
iris.at[3,'species']

'setosa'