# Pandas

- Introduction to Pandas
- Reading the Data
- Functionalities of Pandas: Creation, Viewing, Editing
- Manipulating Data
- Handling NaN
- Handling Duplicates: Row Index, Column Names
- Handling String Data

Pandas is an open source library which provides high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Pandas has a lot of functions that will help in reading and writing data and also for data manipulation. Thus we will be using pandas throughout the course.

Pandas behave like an Excel file.

In [48]:
#Import Pandas
import pandas as pd

# Reading Data
We will use `read_csv()` function. It reads a comma-separated values (csv) file into DataFrame.

In [49]:
#Loading data with read_csv() function. Here we are providing path to the csv file.

#If the file is in your system you can provide its path as well.
iris = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data")

type(iris)

pandas.core.frame.DataFrame

# Pandas Dataframes
DataFrame is an object for data manipulation. You can think of it as a 2D tabular structure, where every row is a dataset entry and columns represents features of data.

In [50]:
iris

Unnamed: 0,5.1,3.5,1.4,0.2,Iris-setosa
0,4.9,3.0,1.4,0.2,Iris-setosa
1,4.7,3.2,1.3,0.2,Iris-setosa
2,4.6,3.1,1.5,0.2,Iris-setosa
3,5.0,3.6,1.4,0.2,Iris-setosa
4,5.4,3.9,1.7,0.4,Iris-setosa
...,...,...,...,...,...
144,6.7,3.0,5.2,2.3,Iris-virginica
145,6.3,2.5,5.0,1.9,Iris-virginica
146,6.5,3.0,5.2,2.0,Iris-virginica
147,6.2,3.4,5.4,2.3,Iris-virginica


By default, the first row of the csv file has been used as column names. 

# Creating copy of DataFrame

In [57]:
df = iris 

# Above statement simply makes df refer to the data frame object that iris is referring to.
# So now both iris and df refer to the same dataframe object and any changes done via one will 
# reflect in other. So effectively this is not creating another dataframe object.   

If we wish to create a copy then we will use copy() function for that

In [58]:
df = iris.copy()

In [59]:
df.shape

(150, 5)

As you can see, we have 149 rows and 5 columns. But actually, this should have been 150 rows, as we already know, the Iris Dataset has information of 3 different types of flower, 50 each. This happened because the first row was taken as the column name. To fix this, we do the following:

In [60]:
# Ignoring header: If you don't want first row to be treated as a header, 
# we can set header = None
iris = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
                   header=None)
iris

Unnamed: 0,0,1,2,3,4
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


In [61]:
df = iris.copy()
df.shape

(150, 5)

To see the datatypes of each column we do the following:

In [62]:
df.dtypes

0    float64
1    float64
2    float64
3    float64
4     object
dtype: object

Currently, our columns have no names.

In [63]:
df.columns

Index([0, 1, 2, 3, 4], dtype='int64')

To give them a name, we simply change the value of df.columns

In [64]:
df.columns = ['sl', 'sw', 'pl', 'pw', 'flower_type']
df

Unnamed: 0,sl,sw,pl,pw,flower_type
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


In [65]:
df.dtypes

sl             float64
sw             float64
pl             float64
pw             float64
flower_type     object
dtype: object

In [66]:
df.describe()

Unnamed: 0,sl,sw,pl,pw
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


# Some Basic Functionalties
## Viewing the DataFrame
We have the head() and tail() function for viewing the dataframe.

head()
This function returns the first n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it.

By default, value of n = 5.

In [67]:
df.head()

Unnamed: 0,sl,sw,pl,pw,flower_type
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [68]:
df.head(9)

Unnamed: 0,sl,sw,pl,pw,flower_type
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa


## tail()
This function returns the last n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it.

By default, value of n = 5.

In [69]:
df.tail()

Unnamed: 0,sl,sw,pl,pw,flower_type
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica


In [70]:
df.tail(9)

Unnamed: 0,sl,sw,pl,pw,flower_type
141,6.9,3.1,5.1,2.3,Iris-virginica
142,5.8,2.7,5.1,1.9,Iris-virginica
143,6.8,3.2,5.9,2.3,Iris-virginica
144,6.7,3.3,5.7,2.5,Iris-virginica
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica


## Accessing Data
Sometimes, we may want to look at a single column from the DataFrame. This can be done simply as:

In [71]:
## Viewing sl column
df.sl

0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
      ... 
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: sl, Length: 150, dtype: float64

In [72]:
df['sl']

0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
      ... 
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: sl, Length: 150, dtype: float64

## Checking for NULL values

In [73]:
df.isnull()

Unnamed: 0,sl,sw,pl,pw,flower_type
0,False,False,False,False,False
1,False,False,False,False,False
2,False,False,False,False,False
3,False,False,False,False,False
4,False,False,False,False,False
...,...,...,...,...,...
145,False,False,False,False,False
146,False,False,False,False,False
147,False,False,False,False,False
148,False,False,False,False,False


In [74]:
# To get a direct overview 
df.isnull().sum()

sl             0
sw             0
pl             0
pw             0
flower_type    0
dtype: int64

## Selection
### iloc[]
We can use the `iloc[ ]` function to access values in dataframe.

It is a purely integer-location based indexing for selection by position. iloc[] is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array.

Allowed inputs are:

- An integer, e.g. 5.
- A list or array of integers, e.g. [4, 3, 0].
- A slice object with ints, e.g. 1:7.
- A boolean array.

In [75]:
df.iloc[1:4, 2:4]

Unnamed: 0,pl,pw
1,1.4,0.2
2,1.3,0.2
3,1.5,0.2


## loc[ ]
This accesses a group of rows and columns by label(s) or a boolean array.

`.loc[ ]` is primarily label based, but may also be used with a boolean array.

Allowed inputs are:

- A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index, and never as an integer position along the index).
- A list or array of labels, e.g. ['a', 'b', 'c'].
- A slice object with labels, e.g. 'a':'f'.
- A boolean array of the same length as the axis being sliced, e.g. [True, False, True].

In [76]:
df1 = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
     index=['cobra', 'viper', 'sidewinder'],
     columns=['max_speed', 'shield'])
df1

Unnamed: 0,max_speed,shield
cobra,1,2
viper,4,5
sidewinder,7,8


In [77]:
df1.loc['viper']

max_speed    4
shield       5
Name: viper, dtype: int64

In [78]:
df1.loc[['viper', 'sidewinder']]

Unnamed: 0,max_speed,shield
viper,4,5
sidewinder,7,8


## DataFrame from Dictionary

In [80]:
mydict = [{'a': 1, 'b': 2, 'c': 3, 'd': 4},
          {'a': 100, 'b': 200, 'c': 300, 'd': 400},
          {'a': 1000, 'b': 2000, 'c': 3000, 'd': 4000 }]
df1 = pd.DataFrame(mydict)
df1

Unnamed: 0,a,b,c,d
0,1,2,3,4
1,100,200,300,400
2,1000,2000,3000,4000


## Manipulating data
### Deletion of data
### drop()
Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. When using a multi-index, labels on different levels can be removed by specifying the level.

It returns us a DataFrame without the removed index or column labels, or None if inplace=True.

In [81]:
df.head()

Unnamed: 0,sl,sw,pl,pw,flower_type
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [82]:
a = df.drop(0)
a.head()

Unnamed: 0,sl,sw,pl,pw,flower_type
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa


In [83]:
df.head()

Unnamed: 0,sl,sw,pl,pw,flower_type
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [84]:
df.drop(0, inplace = True)
df.head()

Unnamed: 0,sl,sw,pl,pw,flower_type
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa


If we try once again

In [85]:
df.drop(0, inplace = True)   #Error Generated
df.head()

KeyError: '[0] not found in axis'

The reason for this is, after dropping 0, the indexing did not change automatically. Now, the labels do not begin from 0, but 1.

As we learnt in the definition, we are removing rows by their labels. To remove rows by their indices, we may do the following:

In [86]:
df.drop(df.index[0], inplace = True)
df.head()

Unnamed: 0,sl,sw,pl,pw,flower_type
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa


In [87]:
df.drop(df.index[3], inplace = True)   ## Label 5 removed
df.head()

Unnamed: 0,sl,sw,pl,pw,flower_type
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa


We may also remove many labels in one go.

In [88]:
df.drop(df.index[[3, 4]], inplace = True)   ## Label 6, 7 removed
df.head()

Unnamed: 0,sl,sw,pl,pw,flower_type
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa


In [89]:
df.drop('sl')   ## Error Generated

KeyError: "['sl'] not found in axis"

An error is generated because the drop function is currently looking for a row with label 'sl'. We need to change the axis.

In [91]:
df.drop('sl',axis=1)

Unnamed: 0,sw,pl,pw,flower_type
2,3.2,1.3,0.2,Iris-setosa
3,3.1,1.5,0.2,Iris-setosa
4,3.6,1.4,0.2,Iris-setosa
8,2.9,1.4,0.2,Iris-setosa
9,3.1,1.5,0.1,Iris-setosa
...,...,...,...,...
145,3.0,5.2,2.3,Iris-virginica
146,2.5,5.0,1.9,Iris-virginica
147,3.0,5.2,2.0,Iris-virginica
148,3.4,5.4,2.3,Iris-virginica


# Conditional Insights
We may use concept of boolean indexing in DataFrame to access a particular type of data, and draw inferenced from it.

Lets try to gain insights of data correspondign to Iris-virginica.

In [92]:
df[df.flower_type == 'Iris-virginica'].describe()

Unnamed: 0,sl,sw,pl,pw
count,50.0,50.0,50.0,50.0
mean,6.588,2.974,5.552,2.026
std,0.63588,0.322497,0.551895,0.27465
min,4.9,2.2,4.5,1.4
25%,6.225,2.8,5.1,1.8
50%,6.5,3.0,5.55,2.0
75%,6.9,3.175,5.875,2.3
max,7.9,3.8,6.9,2.5


## Addition of data
loc()

In [93]:
df.loc[0] = [1, 2, 3, 4, 'Iris-virginica']
df.tail()

Unnamed: 0,sl,sw,pl,pw,flower_type
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica
0,1.0,2.0,3.0,4.0,Iris-virginica


We may directly create new columns also according to our needs.

In [95]:
df["diff_of_sl_sw"] = df['sl'] - df['sw']
df.head()

Unnamed: 0,sl,sw,pl,pw,flower_type,diff_of_sl_sw
2,4.7,3.2,1.3,0.2,Iris-setosa,1.5
3,4.6,3.1,1.5,0.2,Iris-setosa,1.5
4,5.0,3.6,1.4,0.2,Iris-setosa,1.4
8,4.4,2.9,1.4,0.2,Iris-setosa,1.5
9,4.9,3.1,1.5,0.1,Iris-setosa,1.8


In [96]:
df.drop('diff_of_sl_sw', axis = 1, inplace = True)

## Reset Index
After removing certain rows, the order of indices got changed. We can reset it using the reset_index() function.

In [97]:
df.reset_index()

Unnamed: 0,index,sl,sw,pl,pw,flower_type
0,2,4.7,3.2,1.3,0.2,Iris-setosa
1,3,4.6,3.1,1.5,0.2,Iris-setosa
2,4,5.0,3.6,1.4,0.2,Iris-setosa
3,8,4.4,2.9,1.4,0.2,Iris-setosa
4,9,4.9,3.1,1.5,0.1,Iris-setosa
...,...,...,...,...,...,...
141,146,6.3,2.5,5.0,1.9,Iris-virginica
142,147,6.5,3.0,5.2,2.0,Iris-virginica
143,148,6.2,3.4,5.4,2.3,Iris-virginica
144,149,5.9,3.0,5.1,1.8,Iris-virginica


But this has created an additional column with old indices. To avoid that, we do:

In [98]:
df.reset_index(drop = True)

Unnamed: 0,sl,sw,pl,pw,flower_type
0,4.7,3.2,1.3,0.2,Iris-setosa
1,4.6,3.1,1.5,0.2,Iris-setosa
2,5.0,3.6,1.4,0.2,Iris-setosa
3,4.4,2.9,1.4,0.2,Iris-setosa
4,4.9,3.1,1.5,0.1,Iris-setosa
...,...,...,...,...,...
141,6.3,2.5,5.0,1.9,Iris-virginica
142,6.5,3.0,5.2,2.0,Iris-virginica
143,6.2,3.4,5.4,2.3,Iris-virginica
144,5.9,3.0,5.1,1.8,Iris-virginica


## Handling NaN
### Values considered “missing”

- As data comes in many shapes and forms, pandas aims to be flexible with regard to handling missing data. While NaN is the default missing value marker for reasons of computational speed and convenience.

- we need to be able to easily detect this value with data of different types: floating point, integer, boolean, and general object. 

- In many cases, however, the Python None will arise and we wish to also consider that “missing” or “not available” or “NA”.

- To make detecting missing values easier (and across different array dtypes), pandas provides the `isna()` and `notna()` functions, which are also methods on Series and DataFrame objects.

- Because NaN is a float, a column of integers with even one missing values is cast to floating-point dtype

NaN values can create inaccuracies in our estimations and calculations. There are two ways we can handle NaN:

- we either remove them,
- or we fill them.
Our current data does not have any NaN values, so we will create some.

In [99]:
import numpy as np
df = iris.copy()
df.columns = ['sl', 'sw', 'pl', 'pw', 'flower_type']

In [100]:
df.iloc[2:4, 1:3] = np.nan
df.head()

Unnamed: 0,sl,sw,pl,pw,flower_type
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,,,0.2,Iris-setosa
3,4.6,,,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [101]:
df.describe()

Unnamed: 0,sl,sw,pl,pw
count,150.0,148.0,148.0,150.0
mean,5.843333,3.052703,3.790541,1.198667
std,0.828066,0.436349,1.754618,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.4,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


# Dropping NaN
`dropna()` : This will remove the row or column entries with NaN values.

In [102]:
df.dropna(inplace = True)  ## Remove NaN inside df only
df.reset_index(drop = True, inplace = True)   ## Reset the indices

In [103]:
df.head()

Unnamed: 0,sl,sw,pl,pw,flower_type
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,5.0,3.6,1.4,0.2,Iris-setosa
3,5.4,3.9,1.7,0.4,Iris-setosa
4,4.6,3.4,1.4,0.3,Iris-setosa


As you may observe, we have removed the row with NaN. If we want to remove the column, we shall use 'axis' parameter.

## Filling NaN
`fillna()` : You can also fill NaN using a dict or Series that is alignable. The labels of the dict or index of the Series must match the columns of the frame you wish to fill.

Generally we fill the NaN values with the mean, but depending on the type of data, and your own analysis, you may decide to will NaN in some other way.

In [104]:
df.iloc[2:4, 1:3] = np.nan
df.head()

Unnamed: 0,sl,sw,pl,pw,flower_type
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,5.0,,,0.2,Iris-setosa
3,5.4,,,0.4,Iris-setosa
4,4.6,3.4,1.4,0.3,Iris-setosa


In [105]:
df.sw.fillna(df.sw.mean(), inplace = True)
df.pl.fillna(df.pl.mean(), inplace = True)
df.head()

Unnamed: 0,sl,sw,pl,pw,flower_type
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,5.0,3.043151,3.821233,0.2,Iris-setosa
3,5.4,3.043151,3.821233,0.4,Iris-setosa
4,4.6,3.4,1.4,0.3,Iris-setosa


Note: Since all the NaN values belonged to 'Iris-setosa', a better value to fill NaN's would have been the mean of those values of 'sw', where flower type is Iris-setosa.

In [106]:
df.iloc[2:4, 1:3] = np.nan
df.head()

Unnamed: 0,sl,sw,pl,pw,flower_type
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,5.0,,,0.2,Iris-setosa
3,5.4,,,0.4,Iris-setosa
4,4.6,3.4,1.4,0.3,Iris-setosa


In [107]:
df_setosa = df[df.flower_type == 'Iris-setosa']
df.sw.fillna(df_setosa.sw.mean(), inplace = True)
df.pl.fillna(df_setosa.pl.mean(), inplace = True)
df.head()

Unnamed: 0,sl,sw,pl,pw,flower_type
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,5.0,3.415217,1.463043,0.2,Iris-setosa
3,5.4,3.415217,1.463043,0.4,Iris-setosa
4,4.6,3.4,1.4,0.3,Iris-setosa


## Duplicate Labels
Index objects are not required to be unique; you can have duplicate row or column labels.

But one of pandas’ roles is to clean messy, real-world data before it goes to some downstream system. And real-world data has duplicates, even in fields that are supposed to be unique.

Lets see how duplicate labels change the behavior of certain operations, and how prevent duplicates from arising during operations, or to detect them if they do.

## Consequences of Duplicate Labels
Some pandas methods (Series.reindex() for example) just don’t work with duplicates present. The output can’t be determined, and so pandas raises.

Other methods, like indexing, can give very surprising results. Typically indexing with a scalar will reduce dimensionality. Slicing a DataFrame with a scalar will return a Series. Slicing a Series with a scalar will return a scalar. But with duplicates, this isn’t the case.

In [109]:
df1 = pd.DataFrame([[0, 1, 2], [3, 4, 5]], columns=["A", "A", "B"])
df1

Unnamed: 0,A,A.1,B
0,0,1,2
1,3,4,5


We have duplicates in the columns. If we slice 'B', we get back a Series

In [110]:
print(df1["B"])  # a series
type(df1["B"])

0    2
1    5
Name: B, dtype: int64


pandas.core.series.Series

But slicing 'A' returns a DataFrame

In [111]:
print(df1["A"]) # a DataFrame
type(df1["A"])

   A  A
0  0  1
1  3  4


pandas.core.frame.DataFrame

This applies to row labels as well.

In [112]:
df2 = pd.DataFrame({"A": [0, 1, 2]}, index=["a", "a", "b"])
df2

Unnamed: 0,A
a,0
a,1
b,2


In [113]:
df2.loc["b", "A"]  # a scalar

2

In [114]:
df2.loc["a", "A"]  # a Series

a    0
a    1
Name: A, dtype: int64

## Duplicate Label Detection
You can check whether an Index (storing the row or column labels) is unique with Index.is_unique:

In [115]:
df2

Unnamed: 0,A
a,0
a,1
b,2


In [116]:
df2.index.is_unique

False

In [117]:
df2.columns.is_unique

True

Index.duplicated() will return a boolean ndarray indicating whether a label is repeated.

In [118]:
df2.index.duplicated()

array([False,  True, False])

## Handling Strings in Data
Our algorithms can make calculations over numerical data. String data is very hard to compute quantitaviely.

It wont make sense to ignore string data. For example, if a dataset is to evaluate shopping habits, and we have a column for gender with categories as 'male' and 'female', we cannot just ignore this, as the habits of both the gender will be very different from each other.

So, to handle such cases, we convert the string data to numerical data.

In [119]:
df

Unnamed: 0,sl,sw,pl,pw,flower_type
0,5.1,3.500000,1.400000,0.2,Iris-setosa
1,4.9,3.000000,1.400000,0.2,Iris-setosa
2,5.0,3.415217,1.463043,0.2,Iris-setosa
3,5.4,3.415217,1.463043,0.4,Iris-setosa
4,4.6,3.400000,1.400000,0.3,Iris-setosa
...,...,...,...,...,...
143,6.7,3.000000,5.200000,2.3,Iris-virginica
144,6.3,2.500000,5.000000,1.9,Iris-virginica
145,6.5,3.000000,5.200000,2.0,Iris-virginica
146,6.2,3.400000,5.400000,2.3,Iris-virginica


In [120]:
df['Gender'] = 'Female'
df.iloc[0:10, 5] = 'Male'
df

Unnamed: 0,sl,sw,pl,pw,flower_type,Gender
0,5.1,3.500000,1.400000,0.2,Iris-setosa,Male
1,4.9,3.000000,1.400000,0.2,Iris-setosa,Male
2,5.0,3.415217,1.463043,0.2,Iris-setosa,Male
3,5.4,3.415217,1.463043,0.4,Iris-setosa,Male
4,4.6,3.400000,1.400000,0.3,Iris-setosa,Male
...,...,...,...,...,...,...
143,6.7,3.000000,5.200000,2.3,Iris-virginica,Female
144,6.3,2.500000,5.000000,1.9,Iris-virginica,Female
145,6.5,3.000000,5.200000,2.0,Iris-virginica,Female
146,6.2,3.400000,5.400000,2.3,Iris-virginica,Female


In [121]:
def func(s):
  if s == 'Male':
    return 0
  else:
    return 1


df['Sex'] = df.Gender.apply(func)
del df['Gender']
df

Unnamed: 0,sl,sw,pl,pw,flower_type,Sex
0,5.1,3.500000,1.400000,0.2,Iris-setosa,0
1,4.9,3.000000,1.400000,0.2,Iris-setosa,0
2,5.0,3.415217,1.463043,0.2,Iris-setosa,0
3,5.4,3.415217,1.463043,0.4,Iris-setosa,0
4,4.6,3.400000,1.400000,0.3,Iris-setosa,0
...,...,...,...,...,...,...
143,6.7,3.000000,5.200000,2.3,Iris-virginica,1
144,6.3,2.500000,5.000000,1.9,Iris-virginica,1
145,6.5,3.000000,5.200000,2.0,Iris-virginica,1
146,6.2,3.400000,5.400000,2.3,Iris-virginica,1


Now, we may apply algorithms which take into consideration the 'Sex' column too.

# Count of Flower

In [123]:
import pandas as pd

column_name = ['SeapalLength','SeapalWidth','PetalLength','PetalWidth','Species']

iris = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data',
                   names=column_name)

for i in iris['Species']:
    print(i,end = " ")

Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor 

In [124]:
import pandas as pd

column_name = ['SeapalLength','SeapalWidth','PetalLength','PetalWidth','Species']

iris = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data',
                   names=column_name)

for i in iris['Species'].value_counts():
    print(i,end = " ")

50 50 50 

# Iris Virginica

In [132]:
import pandas as pd

columns = ['sl','sw','pl','pw','flower_type']

iris = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', 
                   names=columns)

iris = iris[iris.flower_type=='Iris-virginica']

iris = iris[iris.pl>1.5]

# print(iris)
iris = iris.values

for row in iris :
    print(*row)

6.3 3.3 6.0 2.5 Iris-virginica
5.8 2.7 5.1 1.9 Iris-virginica
7.1 3.0 5.9 2.1 Iris-virginica
6.3 2.9 5.6 1.8 Iris-virginica
6.5 3.0 5.8 2.2 Iris-virginica
7.6 3.0 6.6 2.1 Iris-virginica
4.9 2.5 4.5 1.7 Iris-virginica
7.3 2.9 6.3 1.8 Iris-virginica
6.7 2.5 5.8 1.8 Iris-virginica
7.2 3.6 6.1 2.5 Iris-virginica
6.5 3.2 5.1 2.0 Iris-virginica
6.4 2.7 5.3 1.9 Iris-virginica
6.8 3.0 5.5 2.1 Iris-virginica
5.7 2.5 5.0 2.0 Iris-virginica
5.8 2.8 5.1 2.4 Iris-virginica
6.4 3.2 5.3 2.3 Iris-virginica
6.5 3.0 5.5 1.8 Iris-virginica
7.7 3.8 6.7 2.2 Iris-virginica
7.7 2.6 6.9 2.3 Iris-virginica
6.0 2.2 5.0 1.5 Iris-virginica
6.9 3.2 5.7 2.3 Iris-virginica
5.6 2.8 4.9 2.0 Iris-virginica
7.7 2.8 6.7 2.0 Iris-virginica
6.3 2.7 4.9 1.8 Iris-virginica
6.7 3.3 5.7 2.1 Iris-virginica
7.2 3.2 6.0 1.8 Iris-virginica
6.2 2.8 4.8 1.8 Iris-virginica
6.1 3.0 4.9 1.8 Iris-virginica
6.4 2.8 5.6 2.1 Iris-virginica
7.2 3.0 5.8 1.6 Iris-virginica
7.4 2.8 6.1 1.9 Iris-virginica
7.9 3.8 6.4 2.0 Iris-virginica
6.4 2.8 

In this code, we use the Pandas library to read the Iris dataset from the given URL and create a DataFrame called iris. We then filter the DataFrame to include only rows where the flower_type is 'Iris-virginica' and the 'pl' (petal length) is greater than 1.5. Finally, we convert the filtered DataFrame to a NumPy array using the values attribute, and then we iterate through the array and print each row.

# Iris Values

This code below analyzes the Iris dataset for the 'Iris-setosa' and 'Iris-versicolor' flower types. It calculates and prints the minimum, maximum, and mean values of the four features ('sl', 'sw', 'pl', 'pw') for each flower type. The last column (iloc[0, 4]) prints the name of the flower type for each group.

In [134]:
import pandas as pd
columns = ['sl','sw','pl','pw','flower_type']
iris = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', names=columns)

c = iris[iris.flower_type=='Iris-setosa']

print("Min row, max row and Mean Row ")
print('%.2f'%min(c['sl']),
      '%.2f'%min(c['sw']),
      '%.2f'%min(c['pl']),
      '%.2f'%min(c['pw']),
      c.iloc[0,4])
print('%.2f'%max(c['sl']),
      '%.2f'%max(c['sw']),
      '%.2f'%max(c['pl']),
      '%.2f'%max(c['pw']),
      c.iloc[0,4])
print('%.2f'%(c['sl'].mean()),
      '%.2f'%(c['sw'].mean()),
      '%.2f'%(c['pl'].mean()),
      '%.2f'%(c['pw'].mean()),
      c.iloc[0,4])

c1 = iris[iris.flower_type=='Iris-versicolor']

print('%.2f'%min(c1['sl']),
      '%.2f'%min(c1['sw']),
      '%.2f'%min(c1['pl']),
      '%.2f'%min(c1['pw']),
      c1.iloc[0,4])
print('%.2f'%max(c1['sl']),
      '%.2f'%max(c1['sw']),
      '%.2f'%max(c1['pl']),
      '%.2f'%max(c1['pw']),
      c1.iloc[0,4])
print('%.2f'%(c1['sl'].mean()),
      '%.2f'%(c1['sw'].mean()),
      '%.2f'%(c1['pl'].mean()),
      '%.2f'%(c1['pw'].mean()),
      c1.iloc[0,4])

c2 = iris[iris.flower_type=='Iris-virginica']
print('%.2f'%min(c2['sl']),
      '%.2f'%min(c2['sw']),
      '%.2f'%min(c2['pl']),
      '%.2f'%min(c2['pw']),
      c2.iloc[0,4])
print('%.2f'%max(c2['sl']),'%.2f'%max(c2['sw']),'%.2f'%max(c2['pl']),'%.2f'%max(c2['pw']),c2.iloc[0,4])
print('%.2f'%(c2['sl'].mean()),'%.2f'%(c2['sw'].mean()),'%.2f'%(c2['pl'].mean()),'%.2f'%(c2['pw'].mean()),c2.iloc[0,4])

Min row, max row and Mean Row 
4.30 2.30 1.00 0.10 Iris-setosa
5.80 4.40 1.90 0.60 Iris-setosa
5.01 3.42 1.46 0.24 Iris-setosa
4.90 2.00 3.00 1.00 Iris-versicolor
7.00 3.40 5.10 1.80 Iris-versicolor
5.94 2.77 4.26 1.33 Iris-versicolor
4.90 2.20 4.50 1.40 Iris-virginica
7.90 3.80 6.90 2.50 Iris-virginica
6.59 2.97 5.55 2.03 Iris-virginica


`'%.2f'%min(c['sl'])` is a string formatting operation in Python. Let's break it down step by step:

1. `min(c['sl'])`: This part calculates the minimum value of the 'sl' column (sepal length) for the subset of the Iris dataset represented by DataFrame `c`. It finds the smallest value in the 'sl' column that satisfies the condition for the 'Iris-setosa' flower type.

2. `%.2f`: This is a string formatting placeholder. The `%` character indicates that it's a placeholder for a variable to be inserted into the string. `.2f` specifies the format for the variable to be inserted. Here, `.2f` means a floating-point number with two decimal places.

3. `'%.2f'%min(c['sl'])`: This part combines the calculated minimum value from step 1 with the string formatting placeholder from step 2. The `%` operator is used to perform string formatting, and it replaces the `%f` in the string with the value of `min(c['sl'])` formatted as a floating-point number with two decimal places.

In summary, `'%.2f'%min(c['sl'])` is used to format the minimum value of the 'sl' column (sepal length) from DataFrame `c` as a string with two decimal places. The output will be a string representation of the minimum sepal length value with two decimal places. For example, if the minimum sepal length is 4.3, the output would be the string `'4.30'`.

## Terror Attack City

In [139]:
import pandas as pd

path = 'terrorismData.csv'
df_terrorism = pd.read_csv(path, encoding='ISO-8859-1')

# Filter the DataFrame for 'Jammu and Kashmir' state
df_terrorism = df_terrorism[df_terrorism['State'] == 'Jammu and Kashmir']

city_list = df_terrorism['City'].value_counts()

city = city_list.index[0]
attack = city_list.iloc[0]

print("city",city)
print("Attack",attack)

# Filter the DataFrame for the most frequent city
df_terrorism = df_terrorism[df_terrorism['City'] == city]

# Get the second most frequent terrorist group in that city
group = df_terrorism['Group'].value_counts().index[1]

print(city, attack, group)


city Srinagar
Attack 657
Srinagar 657 Muslim Separatists


In [141]:
import pandas as pd

# Sample DataFrame
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)

# Using iloc to access data by integer position
value_at_iloc = df.iloc[0, 1]  # This will be 4 (first row, second column)

# Using index to access the row labels
row_labels = df.index  # This will return the index labels of the DataFrame

# Using value_counts().index to get the most frequent value in a column
most_frequent_value = df['A'].value_counts().index[0]  # This will be 1, the most frequent value in column 'A'
print(most_frequent_value)

1


The expression `df_terrorism['City'].value_counts()` calculates the frequency count of each unique value in the 'City' column of the DataFrame `df_terrorism`. It returns a pandas Series containing the counts for each unique city name.

Let's break it down further:

- `df_terrorism['City']`: This extracts the 'City' column from the DataFrame `df_terrorism`. It gives a Series that contains the city names for each row in the DataFrame.

- `value_counts()`: This is a pandas Series method that calculates the frequency count of each unique value in the Series. It counts how many times each unique city name appears in the 'City' column and arranges the results in descending order of count. The result is a Series where the city names are the index, and the counts are the corresponding values.

For example, suppose the 'City' column contains the following data:

```plaintext
0    Srinagar
1    Srinagar
2    Jammu
3    Jammu
4    Jammu
5    Anantnag
6    Srinagar
```

Then, `df_terrorism['City'].value_counts()` will return the following pandas Series:

```plaintext
Srinagar    3
Jammu       3
Anantnag    1
dtype: int64
```

In this example, 'Srinagar' and 'Jammu' appear three times each, while 'Anantnag' appears only once in the 'City' column of the DataFrame `df_terrorism`.

# Terror Government

In [144]:
import pandas as pd
import numpy as np

# Read the CSV file into a DataFrame
df = pd.read_csv('terrorismData.csv')

# Filter the DataFrame to get rows where 'Day' is greater than or equal to 26
a = df[df.Day >= 26]

# Further filter 'a' to get rows where 'Year' is equal to 2014
b = a[a.Year == 2014]

# Further filter 'b' to get rows where 'Country' is 'India'
c = b[b.Country == 'India']

# Store the filtered DataFrame in 'ans1' where 'Month' is equal to 5
ans1 = c[c.Month == 5]

# Filter the DataFrame to get rows where 'Year' is equal to 2014
d = df[df.Year == 2014]

# Further filter 'd' to get rows where 'Country' is 'India'
e = d[d.Country == 'India']

# Store the filtered DataFrame in 'ans2' where 'Month' is greater than 5
ans2 = e[e.Month > 5]

# Filter the DataFrame to get rows where 'Country' is 'India'
f = df[df.Country == 'India']

# Store the filtered DataFrame in 'ans3' where 'Year' is greater than 2014
ans3 = f[f.Year > 2014]

# Calculate the count of rows in the filtered DataFrames
count = ans1.shape[0] + ans2.shape[0] + ans3.shape[0]

# Remove rows from each DataFrame where 'Group' is 'Unknown'
ans1 = ans1[ans1.Group != 'Unknown']
ans2 = ans2[ans2.Group != 'Unknown']
ans3 = ans3[ans3.Group != 'Unknown']

# Print the count and the most frequent value in the 'Group' column of 'ans3' DataFrame
print(count, ans3.Group.describe().top)

3336 Maoists


`ans1.shape[0]` is an expression that returns the number of rows in the DataFrame 'ans1'. The `shape` attribute of a DataFrame gives the dimensions of the DataFrame as a tuple (number of rows, number of columns). By accessing the element at index 0 of the shape tuple, we get the number of rows.

In the provided code, 'ans1' is a DataFrame that was created based on specific filtering conditions, and `ans1.shape[0]` gives the count of rows that meet those conditions.

For example, if 'ans1' has 50 rows after applying the filtering conditions, then `ans1.shape[0]` will return the value 50.

In [149]:
import pandas as pd
import numpy as np

# Read the CSV file into a DataFrame
df = pd.read_csv('terrorismData.csv')

# Calculate the number of unique years in the dataset
year = len(set(df['Year']))

# Filter the DataFrame to include only rows where 'Country' is 'India'
df = df[df.Country == 'India']

# Create a new column 'Casualty' by summing the 'Killed' and 'Wounded' columns
df['Casualty'] = df['Killed'] + df['Wounded']

# Filter the DataFrame to include only rows where 'State' is 'Jammu and Kashmir'
jk = df[df.State == 'Jammu and Kashmir']

# Filter the DataFrame to include only rows where 'State' is one of the specified states
rc = df[(df.State == 'Jharkhand') | (df.State == 'Odisha') | (df.State == 'Andhra Pradesh') | (df.State == 'Chhattisgarh')]

# Calculate the total casualties in Jammu and Kashmir and specified states
jkc = int(np.sum(jk['Casualty']))
rcc = int(np.sum(rc['Casualty']))

# Calculate average casualties per year in Jammu and Kashmir and specified states
print(rcc // year, jkc // year)

115 261


In [150]:
# Terror DeadliestAttack

In [152]:
import pandas as pd
df = pd.read_csv('terrorismData.csv')

df = df[df.Killed == df.Killed.max()]
mx_killed = df.Killed.iloc[0]
country = df.Country.iloc[0]
group = df.Group.iloc[0]
print(int(mx_killed), country, group)

1570 Iraq Islamic State of Iraq and the Levant (ISIL)
