# Introduction
“Pandas” is a very popular library in python, mostly used when working with large datasets. It’s especially helpful when working with relational/labeled data as it makes it both easy and intuitive. It's also common to be working with both pandas and numpy as they contain useful functions that can be used on one another. 

## Installation
The installation process is the same as numPy and if you have Anaconda installed, numPy and Pandas may have been auto-installed as well. To make sure you’re updated to the latest version or have them installed, simply run: 


In [None]:
conda install numpy
conda install pandas

Similarly, you can use pip to by running: 

In [None]:
pip insatll numpy
pip install pandas

Once you've installed these libraries onto your system, you can access these libraries by adding these few lines of code (using np and pd to make future function calls much more simple)

In [4]:
import pandas as pd
import numpy as np

## Pandas - Series and Dataframes

Pandas Series are very similar to numpy ndarrays but they offer a little more practical usage, especially when it comes to working with labeled data. Series enable users to index data by customized labels. For example, an numPy array creates an array of inputted data that is indexed starting [0, 1, 2, ...]. Series are also indexed in the same way by default, [0, 1, 2, ...], but can be set to customized values such as "age" or "height."

In [15]:
exampleArray = np.array([21, 32, 43])
print(exampleArray)

[21 32 43]


In [12]:
hobbies = np.array(["soccer","basketball","baseball"])
series1 = pd.Series(hobbies)
print(series1)

0        soccer
1    basketball
2      baseball
dtype: object


In [17]:
hobbies = np.array(["soccer","basketball","baseball"])
series1 = pd.Series(hobbies,index=['Emma', 'Swetha', 'Serajh'])
print(series1)

Emma          soccer
Swetha    basketball
Serajh      baseball
dtype: object


Pandas Dataframes build off of these series and can display multiple columns where each column acts as a series. Each column must be of the same data type and each column must have the same number of elements (rows). This can be thought of just as a matrix.

There are many different ways to fill dataframes, but for the purposes of this example, we’ll use lists of lists where each nested list is a row of data. Columns is a keyword that's passed in to create custom column names.

In [22]:
dataf = pd.DataFrame([
    ['Rohan','Handsome',18],
    ['Daniel', 'Strong',19],
    ['Aaron', 'Smart',18]
    ],
    columns=['name','feature','age'])

print(dataf)

     name   feature  age
0   Rohan  Handsome   18
1  Daniel    Strong   19
2   Aaron     Smart   18


## Working with Datasets
To begin working with data, we must first load and read some structured file such as a csv file. CSV stands for “Comma-Separated Values” and it's a way to exchange structured information, like the contents of a spreadsheet, among programs that can't necessarily talk to one another directly.

To read a csv file, we would execute the following line of code where we assign the data to the variable "dataset" and read some file "file.csv" stored somewhere on the user's desktop.

In [14]:
dataset = pd.read_csv("Fish.csv")

In [24]:
dataset

Unnamed: 0,Species,Weight,Length1,Length2,Length3,Height,Width
0,Bream,242.0,23.2,25.4,30.0,11.5200,4.0200
1,Bream,290.0,24.0,26.3,31.2,12.4800,4.3056
2,Bream,340.0,23.9,26.5,31.1,12.3778,4.6961
3,Bream,363.0,26.3,29.0,33.5,12.7300,4.4555
4,Bream,430.0,26.5,29.0,34.0,12.4440,5.1340
...,...,...,...,...,...,...,...
154,Smelt,12.2,11.5,12.2,13.4,2.0904,1.3936
155,Smelt,13.4,11.7,12.4,13.5,2.4300,1.2690
156,Smelt,12.2,12.1,13.0,13.8,2.2770,1.2558
157,Smelt,19.7,13.2,14.3,15.2,2.8728,2.0672


**
for formatting purposes
**

In [12]:
dataset.head(10)

Unnamed: 0,Species,Weight,Length1,Length2,Length3,Height,Width
0,Bream,242.0,23.2,25.4,30.0,11.52,4.02
1,Bream,290.0,24.0,26.3,31.2,12.48,4.3056
2,Bream,340.0,23.9,26.5,31.1,12.3778,4.6961
3,Bream,363.0,26.3,29.0,33.5,12.73,4.4555
4,Bream,430.0,26.5,29.0,34.0,12.444,5.134
5,Bream,450.0,26.8,29.7,34.7,13.6024,4.9274
6,Bream,500.0,26.8,29.7,34.5,14.1795,5.2785
7,Bream,390.0,27.6,30.0,35.0,12.67,4.69
8,Bream,450.0,27.6,30.0,35.1,14.0049,4.8438
9,Bream,500.0,28.5,30.7,36.2,14.2266,4.9594


In [25]:
dataset.tail(10)

Unnamed: 0,Species,Weight,Length1,Length2,Length3,Height,Width
149,Smelt,9.8,10.7,11.2,12.4,2.0832,1.2772
150,Smelt,8.7,10.8,11.3,12.6,1.9782,1.2852
151,Smelt,10.0,11.3,11.8,13.1,2.2139,1.2838
152,Smelt,9.9,11.3,11.8,13.1,2.2139,1.1659
153,Smelt,9.8,11.4,12.0,13.2,2.2044,1.1484
154,Smelt,12.2,11.5,12.2,13.4,2.0904,1.3936
155,Smelt,13.4,11.7,12.4,13.5,2.43,1.269
156,Smelt,12.2,12.1,13.0,13.8,2.277,1.2558
157,Smelt,19.7,13.2,14.3,15.2,2.8728,2.0672
158,Smelt,19.9,13.8,15.0,16.2,2.9322,1.8792


In [40]:
dataset.index

RangeIndex(start=0, stop=159, step=1)

In [41]:
dataset.columns

Index(['Species', 'Weight', 'Length1', 'Length2', 'Length3', 'Height',
       'Width'],
      dtype='object')

#### Pandas has four accessors in total:

 - .loc[] accepts the labels of rows and columns and returns Series or DataFrames. You can use it to get entire rows or columns, as well as their parts.

 - .iloc[] accepts the zero-based indices of rows and columns and returns Series or DataFrames. You can use it to get entire rows or columns, or their parts.

 - .at[] accepts the labels of rows and columns and returns a single data value.

 - .iat[] accepts the zero-based indices of rows and columns and returns a single data value.

In [44]:
dataset.loc[0, "Weight"]

242.0

In [48]:
dataset.loc[:, "Length1"]

0      23.2
1      24.0
2      23.9
3      26.3
4      26.5
       ... 
154    11.5
155    11.7
156    12.1
157    13.2
158    13.8
Name: Length1, Length: 159, dtype: float64

In [53]:
#Same as accessing "Length2"
dataset.iloc[9, 3]

30.7

In [54]:
#Looking at just the first row
dataset.iloc[0, :]

Species    Bream
Weight     242.0
Length1     23.2
Length2     25.4
Length3     30.0
Height     11.52
Width       4.02
Name: 0, dtype: object

## Sums, Differences, and Averages
When working with data sets, it’s important to begin learning the basic functions and methods such as adding, subtracting, and averaging. To get the sum, use dataFrame.sum() to get the sum of a pandas dataframe for both rows and columns. By default, the axis property is set to axis=0 where sum() will take the column sum, but we can change that to axis=1 to get the row totals. Keep in mind that when taking the sum of each row, the elements must be of the same type.

In [70]:
sumCol = dataset.sum()
sumCol

Species    BreamBreamBreamBreamBreamBreamBreamBreamBreamB...
Weight                                               63333.9
Length1                                               4173.3
Length2                                               4518.1
Length3                                               4965.1
Height                                              1426.388
Width                                               702.3802
dtype: object

In [78]:
#USing the iloc accessor to summ only the numeric values
sumRow = dataset.iloc[:,1:].sum(axis=1)
sumRow

0      336.1400
1      388.2856
2      438.5739
3      468.9855
4      537.0780
         ...   
154     52.7840
155     54.6990
156     54.6328
157     67.3400
158     69.7114
Length: 159, dtype: float64

***Taking the average is very similar where we use dataFrame.mean() to get the mean value of each column. Changing the axis property will change from column <-----> row.

In [86]:
#USing the iloc accessor to take the mean of only the numeric values
average = dataset.iloc[:,1:].mean()
average

Weight      398.326415
Length1      26.247170
Length2      28.415723
Length3      31.227044
Height        8.970994
Width         4.417486
len_diff     -2.168553
dtype: float64

***When subtracting columns, the most simple method would be to call individual columns with a subtraction sign between. 

In [92]:
#Creating a new column "len_diff" where we subtracted "Length1" from "Length2"
dataset["len_diff"] = dataset["Length1"] - dataset["Length2"]
dataset

Unnamed: 0,Species,Weight,Length1,Length2,Length3,Height,Width,len_diff
0,Bream,242.0,23.2,25.4,30.0,11.5200,4.0200,-2.2
1,Bream,290.0,24.0,26.3,31.2,12.4800,4.3056,-2.3
2,Bream,340.0,23.9,26.5,31.1,12.3778,4.6961,-2.6
3,Bream,363.0,26.3,29.0,33.5,12.7300,4.4555,-2.7
4,Bream,430.0,26.5,29.0,34.0,12.4440,5.1340,-2.5
...,...,...,...,...,...,...,...,...
154,Smelt,12.2,11.5,12.2,13.4,2.0904,1.3936,-0.7
155,Smelt,13.4,11.7,12.4,13.5,2.4300,1.2690,-0.7
156,Smelt,12.2,12.1,13.0,13.8,2.2770,1.2558,-0.9
157,Smelt,19.7,13.2,14.3,15.2,2.8728,2.0672,-1.1


### Apply Function
