<H1><B>PANDAS</B></H1>
Welcome to the lesson on Pandas.Pandas is a package for data manipulation and analysis in Python. The name Pandas is derived from the econometrics term Panel Data. Pandas incorporates two additional data structures into Python, namely <b>Pandas Series</b> and <b>Pandas DataFrame</b>. These data structures allow us to work with labeled and relational data in an easy and intuitive manner. These lessons are intended as a basic overview of Pandas and introduces some of its most important features.

## **Why Use Pandas?**
The recent success of machine learning algorithms is partly due to the huge amounts of data that we have available to train our algorithms on. However, when it comes to data, quantity is not the only thing that matters, the quality of your data is just as important.

More often than not, large datasets will often have missing values, outliers, incorrect values, etc… Having data with a lot of missing or bad values, for example, is not going to allow your machine learning algorithms to perform well.

his is where Pandas come in. Pandas Series and DataFrames are designed for fast data analysis and manipulation, as well as being flexible and easy to use. Below are just a few features that makes Pandas an excellent package for data analysis:

- Allows the use of labels for rows and columns
- Can calculate rolling statistics on time series data
- Easy handling of NaN values
- Is able to load data of different formats into DataFrames
- Can join and merge different datasets together
- It integrates with NumPy and Matplotlib

For these and other reasons, Pandas DataFrames have become one of the most commonly used Pandas object for data analysis in Python.

## **PANDAS SERIES:**
A Pandas series is a one-dimensional array-like object that can hold many data types, such as numbers or strings. One of the main differences between Pandas Series and NumPy ndarrays is that you can assign an index label to each element in the Pandas Series. In other words, you can name the indices of your Pandas Series anything you want. Another big difference between Pandas Series and NumPy ndarrays is that Pandas Series can hold data of different data types.

```
`# This is formatted as code`
```



In [1]:
# We first have to import pandas using import statement in python

import pandas as pd

Let's create a pandas series.  For that we just write <B>pd.Series()</b>

```
items = pd.Series(data, index)    # For now we just names the series object in a variable named item
```



So while creating series we can pass arguments for <u>data</u> and <u>index</u>

In [2]:
item = pd.Series(data=[15, 5, 'No'], index=['chocolates', 'chips', 'milk'])
print(item)

chocolates    15
chips          5
milk          No
dtype: object


As we see the series is displayed as indicies in the first column ans data in the second column.

<br><br>
## **Attributes of Panda Series:**

Let's see some of the attributes of the pandas series that helps us to understand our series

In [3]:
item.shape   # gives us the  sizes of each dimension of the data

(3,)

In [4]:
item.ndim   # gives us the number of dimensions of the data

1

In [5]:
item.size   # gives us the total number of items in the array 

3

In [6]:
item.index   # gives us the list indeices of the series 

Index(['chocolates', 'chips', 'milk'], dtype='object')

In [7]:
item.values    #gives us the data of the series

array([15, 5, 'No'], dtype=object)

If you are dealing with a very large Pandas Series and if you are not sure whether an index label exists, you can check by using the ```in``` command

## **Accessing Elements in Pandas Series:**
Now let's look at how we can access or modify elements in a Pandas Series. One great advantage of Pandas Series is that it allows us to access data in many different ways. Elements can be accessed using index labels or numerical indices inside square brackets, [ ], similar to how we access elements in NumPy ndarrays. Since we can use numerical indices, we can use both positive and negative integers to access data from the beginning or from the end of the Series, respectively. 

In [8]:
# One by their index labels
item['chocolates']

15

In [9]:
item[['chocolates', 'milk']]   #a list of indices is passed 

chocolates    15
milk          No
dtype: object

<br>

##  **Arithmatic operations on Pandas Series:**
Just like with NumPy ndarrays, we can perform element-wise arithmetic operations on Pandas Series.
We will look at the arithematic operations between pandas series and single numbers.<br>
Let's first make a new series

In [10]:
sweets = pd.Series(data=[10, 5, 7], index=['candies', 'donuts', 'ladoos'])
sweets

candies    10
donuts      5
ladoos      7
dtype: int64

We can now modify the data in sweets by performing basic arithmetic operations. Let's see some examples

In [11]:
sweets+2

candies    12
donuts      7
ladoos      9
dtype: int64

In [12]:
sweets-2

candies    8
donuts     3
ladoos     5
dtype: int64

In [13]:
sweets*2

candies    20
donuts     10
ladoos     14
dtype: int64

In [14]:
sweets/2

candies    5.0
donuts     2.5
ladoos     3.5
dtype: float64

In [15]:
# WE CAN ALSO APPLY MATHEMATICAL FUNCTIONS FROM NUPY SUCH AS SQAURE ROOT
import numpy as np

np.sqrt(sweets)

candies    3.162278
donuts     2.236068
ladoos     2.645751
dtype: float64

Pandas also allows us to only apply arithmetic operations on selected items in our sweets list. Let's see some examples

In [16]:
np.power(sweets, 4)

candies    10000
donuts       625
ladoos      2401
dtype: int64

In [17]:
np.exp(sweets)

candies    22026.465795
donuts       148.413159
ladoos      1096.633158
dtype: float64

We can also apply arithemeatic operations on specific elements

<br><br><hr>
<H3><B>PANDAS DATAFRAME</B></H3>

Dataframe is a two dimensional object which holds rows and columns and can hold values of different data types.

* We can create a dataframe manually or by loading data from a file.

<br>
Let's first ceate a dataframe manually:<<br>

*  First let's create a dictionary of pandas series and pass it into pandas dataframe

In [19]:
item = {'Column_1': pd.Series([250, 15, 70, 100], index=['watch', 'toys', 'glasses', 'shirt']),
        'Column_2': pd.Series([120, 50, 90], index=['pants', 'books', 'toys' ]), 
        'Column_3': pd.Series([120, 50, 90], index=['shirt', 'books', 'toys' ])}

# item is a dictionary for two people containing some items and the cost of the item 

In [21]:
# WE CAN CREATE A DATA FRAME BY PASSINF THE DICTIONARY TO THE DataFrame FUNCTION

cart = pd.DataFrame(item)
cart

Unnamed: 0,Column_1,Column_2,Column_3
books,,50.0,50.0
glasses,70.0,,
pants,,120.0,
shirt,100.0,,120.0
toys,15.0,90.0,90.0
watch,250.0,,


* Make sure to capitalize the **D** and **F** while calling the dataframe function.
* The dataframe is displayed in the tabular form
* The row labels for the dataframe are built from the union of the index labels we provided in the series and the column labels for the dataframe is taken from the keys of the dictionaries.
* The dataframe has NaN values because for Column_2 we have no item like books and pants in the dicrionary we provided and similarly we have NaN values  for column Column_1.

In [22]:
# IN ABOVE EXAMPLE WE PROVIDED THE DICTIONARIES THAT CLEARLY DEFINED THE INDEX LABLES, HOWEVER IF WE DON'T PROVIDE THE INDEX LABELS, 
# THEN THE DATAFRAME WOULD USE THE NUMERICAL INDEX VALUES
# LET'S CREATE THE SAME DICIONARY WITHOUT THE INDEXED LABELS
new_item = {'Column_A': pd.Series([250, 15, 70, 100]),
        'Column_B': pd.Series([120, 50, 90])}


# NOW MAKE THE DATAFRAME USING THE NEW DICTIONARIES 
df = pd.DataFrame(new_item)
df

Unnamed: 0,Column_A,Column_B
0,250,120.0
1,15,50.0
2,70,90.0
3,100,


The dataframe uses the numerical indices.

<br>
<hr>
<h4><b>2.a Attributes</b></h4>

Like we did in pandas series, we can also extract information from pandas dataframe using some attributes.

In [23]:
cart.index

Index(['books', 'glasses', 'pants', 'shirt', 'toys', 'watch'], dtype='object')

In [24]:
cart.columns

Index(['Column_1', 'Column_2', 'Column_3'], dtype='object')

In [25]:
cart.values

array([[ nan,  50.,  50.],
       [ 70.,  nan,  nan],
       [ nan, 120.,  nan],
       [100.,  nan, 120.],
       [ 15.,  90.,  90.],
       [250.,  nan,  nan]])

In [26]:
cart.shape

(6, 3)

In [27]:
cart.ndim

2

In [28]:
cart.size

18

><b>NOTE:</B> While creating the cart dataframe, we passed the whole dictionary to the dataframe function. However, there might be cases when we are only interested in some specific subset of the whole data. Pandas let's us select which data we want to put into the DataFrame, with the keywords **columns** and **index**.

In [33]:
my_cart = pd.DataFrame(item, columns=['Column_1'])
my_cart

Unnamed: 0,Column_1
watch,250
toys,15
glasses,70
shirt,100


In [34]:
selected_item = pd.DataFrame(item, index=['pants', 'toys'])
selected_item

Unnamed: 0,Column_1,Column_2,Column_3
pants,,120,
toys,15.0,90,90.0


We can also create a dataframe from a dictionary of lists or arrays. The procedure is same as before, we start by creating the dictionary and then pass it into the dataframe function. In this case however all the list or arrays in the dictionary must be of the same length.

In [35]:
# Here's the dictionary of the integers and the floats.
data = {'Integers':[1,2,3],
         'Floats':[1.1, 2.2, 3.3]}

df = pd.DataFrame(data, index=['label1', 'label2', 'label3'])    ## IF WE DON'T PASS THE INDICES, DATAFRAME WILL AUTOMATICALLY USE NUMERICAL INDICES
df

Unnamed: 0,Integers,Floats
label1,1,1.1
label2,2,2.2
label3,3,3.3


In [36]:
# Creating DataFrame using a list of python dictionaries 
ListOfDict = [{'apple':20, 'banana':15, 'orange':30},{'apple':10, 'tomato':17, 'grapes': 35}]

df = pd.DataFrame(ListOfDict)
df

Unnamed: 0,apple,banana,orange,tomato,grapes
0,20,15.0,30.0,,
1,10,,,17.0,35.0


DataFrame used the numerical row indices.
If we want to assign the row indices some values, we can use:


In [37]:
df.index = ['personA', 'personB']
df  

Unnamed: 0,apple,banana,orange,tomato,grapes
personA,20,15.0,30.0,,
personB,10,,,17.0,35.0


**Accessing the values in a DataFrame**

In [38]:
df[['apple']]     ##accessing a column

Unnamed: 0,apple
personA,20
personB,10


In [39]:
df[['banana','tomato']]     ##accessing by passing a list of columns

Unnamed: 0,banana,tomato
personA,15.0,
personB,,17.0


In [40]:
df.loc[['personA','personB']]    ##accessing a row

Unnamed: 0,apple,banana,orange,tomato,grapes
personA,20,15.0,30.0,,
personB,10,,,17.0,35.0


In [41]:
df.iloc[[1,0]]    ##accessing a rowusing iloc -> integer location

Unnamed: 0,apple,banana,orange,tomato,grapes
personB,10,,,17.0,35.0
personA,20,15.0,30.0,,


In [42]:
df['grapes']['personB']    ##accessing a specific value

35.0

><b>NOTE:</B> While accessing the specific element, the column label always comes the first then the row label.

In [43]:
## IF WE WANT TO ADD A NEW COLUMN TO OUR DATAFRAME,  WE CAN ADD LIKE THIS:

df['corn'] = [5, 7]
df

Unnamed: 0,apple,banana,orange,tomato,grapes,corn
personA,20,15.0,30.0,,,5
personB,10,,,17.0,35.0,7


We can also add new columns using the arithematic operations on the other columns of our DataFrame.<br>
For eg: we can add a new column vegies by adding the values for corn and tomato


In [44]:
new_person = [{'apple':15, 'banana': 17, 'corn': 3, 'orange':5}]
new_df = pd.DataFrame(new_person, index=['personC'])
new_df

Unnamed: 0,apple,banana,corn,orange
personC,15,17,3,5


In [45]:
## WE CAN NOW ADD THE NEW ROW TO THE ORIGINAL DATAFRAME

df = df.append(new_df)
df

Unnamed: 0,apple,banana,orange,tomato,grapes,corn
personA,20,15.0,30.0,,,5
personB,10,,,17.0,35.0,7
personC,15,17.0,5.0,,,3


## **Dealing with NaN values:**
As mentioned earlier, before we can begin training our learning algorithms with large datasets, we usually need to clean the data first. This means we need to have a method for detecting and correcting errors in our data. 

While any given dataset can have many types of bad data, such as outliers or incorrect values, the type of bad data we encounter almost always is missing values. As we saw earlier, Pandas assigns ```NaN``` values to missing data. In this lesson we will learn how to detect and deal with ```NaN``` values.

We will begin by creating a DataFrame with some ```NaN``` values in it.




In [46]:
import pandas as pd

# We create a list of Python dictionaries
items2 = [{'bikes': 20, 'pants': 30, 'watches': 35, 'shirts': 15, 'shoes':8, 'suits':45},
{'watches': 10, 'glasses': 50, 'bikes': 15, 'pants':5, 'shirts': 2, 'shoes':5, 'suits':7},
{'bikes': 20, 'pants': 30, 'watches': 35, 'glasses': 4, 'shoes':10}]

# We create a DataFrame  and provide the row index
store_items = pd.DataFrame(items2, index = ['store 1', 'store 2', 'store 3'])

# We display the DataFrame
store_items

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
store 1,20,30,35,15.0,8,45.0,
store 2,15,5,10,2.0,5,7.0,50.0
store 3,20,30,35,,10,,4.0


In cases where we load very large datasets into a DataFrame, possibly with millions of items, the number of NaN values is not easily visualized. For these cases, we can use a combination of methods to count the number of ```NaN``` values in our data. The following example combines the ```.isnull()``` and the ```sum()``` methods to count the number of ```NaN``` values in our DataFrame

In [47]:
store_items.isnull()

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
store 1,False,False,False,False,False,False,True
store 2,False,False,False,False,False,False,False
store 3,False,False,False,True,False,True,False


> In Pandas, logical True values have numerical value 1 and logical False values have numerical value 0. Therefore, we can count the number of NaN values by counting the number of logical True values.

In [48]:
# We count the number of NaN values in the columns of store_items
x =  store_items.isnull().sum()

# We print x
print('Number of NaN values in our DataFrame:', x)

Number of NaN values in our DataFrame: bikes      0
pants      0
watches    0
shirts     1
shoes      0
suits      1
glasses    1
dtype: int64


Instead of counting the number of NaN values we can also do the opposite, we can count the number of non-NaN values. We can do this by using the .count() method as shown below:

In [49]:
# We print the number of non-NaN values in our DataFrame
print()
print('Number of non-NaN values in the columns of our DataFrame:\n', 
        store_items.count())


Number of non-NaN values in the columns of our DataFrame:
 bikes      3
pants      3
watches    3
shirts     2
shoes      3
suits      2
glasses    2
dtype: int64


Now that we learned how to know if our dataset has any NaN values in it, the next step is to decide what to do with them. In general we have two options, we can either delete or replace the NaN values. In the following examples we will show you how to do both.

We will start by learning how to eliminate rows or columns from our DataFrame that contain any NaN values. The .dropna(axis) method eliminates any rows with NaN values when axis = 0 is used and will eliminate any columns with NaN values when axis = 1 is used. Let's see some examples

In [50]:
# We drop any rows with NaN values
store_items.dropna(axis = 0)

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
store 2,15,5,10,2.0,5,7.0,50.0


In [51]:
# We drop any columns with NaN values
store_items.dropna(axis = 1)

Unnamed: 0,bikes,pants,watches,shoes
store 1,20,30,35,8
store 2,15,5,10,5
store 3,20,30,35,10


Notice that the .dropna() method eliminates (drops) the rows or columns with NaN values out of place. This means that the original DataFrame is not modified. You can always remove the desired rows or columns in place by setting the keyword inplace = True inside the dropna() function.

Now, instead of eliminating NaN values, we can replace them with suitable values. We could choose for example to replace all NaN values with the value 0. We can do this by using the .fillna() method as shown below.

In [52]:
# We replace all NaN values with 0
store_items.fillna(0)

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
store 1,20,30,35,15.0,8,45.0,0.0
store 2,15,5,10,2.0,5,7.0,50.0
store 3,20,30,35,0.0,10,0.0,4.0


In machine learning you will most likely use databases from many sources to train your learning algorithms. Pandas allows us to load databases of different formats into DataFrames. One of the most popular data formats used to store databases is csv. CSV stands for Comma Separated Values and offers a simple format to store data. We can load CSV files into Pandas DataFrames using the ```pd.read_csv()``` function. Let's load Google stock data into a Pandas DataFrame. The GOOG.csv file contains Google stock data from 8/19/2004 till 10/13/2017 taken from Yahoo Finance.

In [64]:
import pandas as pd

# We load Google stock data in a DataFrame
Google_stock = pd.read_csv('https://raw.githubusercontent.com/thecodescholar/DA_Python_Jun_21/main/Dataset/GOOG.csv')

# We print some information about Google_stock
print('Google_stock is of type:', type(Google_stock))
print('Google_stock has shape:', Google_stock.shape)

Google_stock is of type: <class 'pandas.core.frame.DataFrame'>
Google_stock has shape: (3313, 7)


We see that we have loaded the stocks.csv file into a Pandas DataFrame and it consists of 3,313 rows and 7 columns. Now let's look at the stock data

In [65]:
Google_stock

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2004-08-19,49.676899,51.693783,47.669952,49.845802,49.845802,44994500
1,2004-08-20,50.178635,54.187561,49.925285,53.805050,53.805050,23005800
2,2004-08-23,55.017166,56.373344,54.172661,54.346527,54.346527,18393200
3,2004-08-24,55.260582,55.439419,51.450363,52.096165,52.096165,15361800
4,2004-08-25,52.140873,53.651051,51.604362,52.657513,52.657513,9257400
...,...,...,...,...,...,...,...
3308,2017-10-09,980.000000,985.424988,976.109985,977.000000,977.000000,891400
3309,2017-10-10,980.000000,981.570007,966.080017,972.599976,972.599976,968400
3310,2017-10-11,973.719971,990.710022,972.250000,989.250000,989.250000,1693300
3311,2017-10-12,987.450012,994.119995,985.000000,987.830017,987.830017,1262400


We see that it is quite a large dataset and that Pandas has automatically assigned numerical row indices to the DataFrame. Pandas also used the labels that appear in the data in the CSV file to assign the column labels.

When dealing with large datasets like this one, it is often useful just to take a look at the first few rows of data instead of the whole dataset. We can take a look at the first 5 rows of data using the ```.head()``` method, as shown below

In [66]:
Google_stock.head(10)

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2004-08-19,49.676899,51.693783,47.669952,49.845802,49.845802,44994500
1,2004-08-20,50.178635,54.187561,49.925285,53.80505,53.80505,23005800
2,2004-08-23,55.017166,56.373344,54.172661,54.346527,54.346527,18393200
3,2004-08-24,55.260582,55.439419,51.450363,52.096165,52.096165,15361800
4,2004-08-25,52.140873,53.651051,51.604362,52.657513,52.657513,9257400
5,2004-08-26,52.135906,53.626213,51.991844,53.606342,53.606342,7148200
6,2004-08-27,53.700729,53.959049,52.503513,52.732029,52.732029,6258300
7,2004-08-30,52.299839,52.40416,50.675404,50.675404,50.675404,5235700
8,2004-08-31,50.819469,51.519913,50.74992,50.85424,50.85424,4954800
9,2004-09-01,51.018177,51.152302,49.512966,49.80109,49.80109,9206800


We can also take a look at the last 5 rows of data by using the ```.tail()``` method:

In [67]:
Google_stock.tail(7)

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
3306,2017-10-05,955.48999,970.909973,955.179993,969.960022,969.960022,1213800
3307,2017-10-06,966.700012,979.460022,963.359985,978.890015,978.890015,1173900
3308,2017-10-09,980.0,985.424988,976.109985,977.0,977.0,891400
3309,2017-10-10,980.0,981.570007,966.080017,972.599976,972.599976,968400
3310,2017-10-11,973.719971,990.710022,972.25,989.25,989.25,1693300
3311,2017-10-12,987.450012,994.119995,985.0,987.830017,987.830017,1262400
3312,2017-10-13,992.0,997.210022,989.0,989.679993,989.679993,1157700


We can also optionally use ```.head(N)``` or ```.tail(N)``` to display the first and last N rows of data, respectively.

Let's do a quick check to see whether we have any ```NaN``` values in our dataset. To do this, we will use the ```.isnull()``` method followed by the ```.any()``` method to check whether any of the columns contain ```NaN``` values.

In [68]:
Google_stock.isnull().any()

Date         False
Open         False
High         False
Low          False
Close        False
Adj Close    False
Volume       False
dtype: bool

### Great Job

## All the Best

# THE CODE SCHOLAR