# **Pandas**

####  Data handling, manipulation and cleaning, pre-processing library.

In [1]:
import pandas as pd

In [7]:
#  To bring in data and also rename/change the indices to whatever we want to change it to -
data = pd.Series([0.25, 1.5, 0.5, 2.5, 6.25], index = ['a', 'b', 'c', 'd', 'e'])
data.values 
data.index

array([0.25, 1.5 , 0.5 , 2.5 , 6.25])

#### So basically this turns it into a dictionary with the speed and optimisation of numpy and pandas.

In [8]:
type(data)

pandas.core.series.Series

In [9]:
type(data.values)

numpy.ndarray

In [12]:
#Declaration by passing a pre-made dictionary 
dict = {'A':4.5, 'B':5.1, 'C':8.6, 'D':7.4}
data = pd.Series(dict)
data

A    4.5
B    5.1
C    8.6
D    7.4
dtype: float64

We can apply all the methods of slicing and all that we had learned in NumPy here, 
just use the indices from the dictionary or the indices specified while creation.

#### Series is used to handle 1-D stuff, for 2-D stuff we have the dataframes.

In [15]:
import pandas as pd

data = {   # This is a dictionary.
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])

print(df)
print(df.T) # transpose the obtained data frame

      calories  duration
day1       420        50
day2       380        40
day3       390        45
          day1  day2  day3
calories   420   380   390
duration    50    40    45


We can use it this way or we can take the two data data arrayas as pandas series or normal dictionaries with the same indices.

In [16]:
df['CPM'] = df['calories']/df['duration']  # Addeed another column.
df 

Unnamed: 0,calories,duration,CPM
day1,420,50,8.4
day2,380,40,9.5
day3,390,45,8.666667


## **Handling missing, bad data with pandas.**

![image.png](attachment:image.png)

The data value not specified becomes the NaN or the None type and we have to handle such missing, bad data.

We can use -

pd.fillna(0)    Fills every NaN value with a fixed value, that is 0.

pd.dropna       Drops all the rrows and colums having the Nan in them.(Results in heavy data loss obviously)

#### One common conflict while sllicing on Series with numeric indices defined by us is, when we slice the series, will the slicing take into account the implicit indices(The by default 0 to n-1 indexing) or the explicit indices(The user defined numeric indices)

![image.png](attachment:image.png)

So use iloc and loc to resolve that issue.

## **Using CSV files with Pandas**

In [17]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

In [None]:
df = pd.read_csv('E:/covid/covid_19_data.csv')
df.head(15)        # Displays the first/top 15 records of the dataframe in question.

In [None]:
df.drop(['Sno.', 'Last Update'], axis=1, inPlace = True)     # axis = 1 means earase the column, not the row.
# inPlace True means update the changes to the orignal dataframe.

![image.png](attachment:image.png)

df.describe shows most of the basic stats about the data obtained from the csv file.

#### Sklearn (Scikit learn imputer example)

In [None]:
import numpy as np

# Importing the SimpleImputer class
from sklearn.impute import SimpleImputer

# Imputer object using the mean strategy and 
# missing_values type for imputation
imputer = SimpleImputer(missing_values = np.nan, 
						strategy ='mean')

data = [[12, np.nan, 34], [10, 32, np.nan], 
		[np.nan, 11, 20]]

print("Original Data : \n", data)
# Fitting the data to the imputer object
imputer = imputer.fit(data)

# Imputing the data	 
data = imputer.transform(data)

print("Imputed Data : \n", data)

# Remember: The mean or median is taken along the column of the matrix