<a href="https://colab.research.google.com/github/fdmatoz/UWIntroductiontoNeuralNetworks/blob/main/Class5_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Class 5: A quick and dirty introduction to pandas and python classes
---
Dr. Daniel Matoz


## Pandas

 Pandas stands for Python Data Analysis Library. This tool is essentially your data’s home. Through pandas, you get acquainted with your data by cleaning, transforming, and analyzing it. For example, say you want to explore a dataset stored in a CSV on your computer. Pandas will extract the data from that CSV into a DataFrame — a table, basically — then let you do things like:

* Calculate statistics and answer questions about the data, like
What's the average, median, max, or min of each column?
* Does column A correlate with column B?
* What does the distribution of data in column C look like?
* Clean the data by doing things like removing missing values and filtering rows or columns by some criteria
* Visualize the data with help from Matplotlib. Plot bars, lines, histograms bubbles, and more.
* Store the cleaned, transformed data back into a CSV, other file or database


Before you jump into the modeling or the complex visualizations you need to have a good understanding of the nature of your dataset and pandas is the best avenue through which to do that.

### Loading and Saving Data with Pandas
When you want to use Pandas for data analysis, you’ll usually use it in one of three different ways:

1.   Convert a Python’s list, dictionary or Numpy array to a Pandas data frame
2.   Open a local file using Pandas, usually a CSV file, but could also be a delimited text file (like TSV), Excel, etc
3.  Open a remote file or database like a CSV or a JSONon a website through a URL or read from a SQL table/database


In order to convert a certain Python object (dictionary, lists etc) the basic command is:

``` python
pd.DataFrame()
```

For reading files, there are different commands to each of these options, but when you open a file, they would look like this:

``` python
pd.read_filetype()
```


### Viewing the data
unning the name of the data frame would give you the entire table, but you can also get the first *n* rows with 

```
df.head(n)
```

or the last *n* rows with 

```
df.tail(n)
```

As in Numpy the shape command returns the number of rows and columns

```
df.shape
```
Other useful command is ```df.info()``` which returns the index, datatype and used memory information.

### Describing the dataset

A very useful command is ```df.describe()``` which inputs summary statistics for numerical columns. It is also possible to get statistics on the entire data frame or a series (a column etc):

* ```df.mean()``` Returns the mean of all columns
* ```df.corr()``` Returns the correlation between columns in a data frame
* ```df.count()``` Returns the number of non-null values in each data frame column
* ```df.max()``` Returns the highest value in each column
* ```df.min()``` Returns the lowest value in each column
* ```df.median()``` Returns the median of each column
* ```df.std()``` Returns the standard deviation of each column

### Selection

Pandas provides a very easy and fast way of selecting data when is compared   to selecting a value from a list or a dictionary. For selecting a given column we simpy use:

```
df[col]
```
for selecting several columns at once 
```
df[[col1, col2]]
```
which returns columns as a new DataFrame. Other ways to select datais by position (```s.iloc[0]```), or by index (```s.loc['index_one']```) . 

To select the first row you can use ```df.iloc[0,:]``` and to select the first element of the first column you would run ```df.iloc[0,0]``` . These can also be used in different combinations.

## Example: Reading from csv from an url

We will read a csv file from the [UC Irvine Machine Learning Repository](https://archive-beta.ics.uci.edu/). The dataset correspond to a famous IRIS dataset which contains information about flowers.

### Dataset information

1. Number of Instances: 150 (50 in each of three classes)

2. Number of Attributes: 4 numeric, predictive attributes and the class

3. Attribute Information:
   1. sepal length in cm
   2. sepal width in cm
   3. petal length in cm
   4. petal width in cm
   5. class: 
      -- Iris Setosa
      -- Iris Versicolour
      -- Iris Virginica


In [5]:
import pandas as pd

# Read the dataset
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'

header_names = ["sepal length", "sepal width", "petal length", "petal width", "class"]

df = pd.read_csv(url, names = header_names)

In [8]:
# Print the shape of the dataset
df.shape

(150, 5)

In [6]:
# Print the first 10 values (head) in the dataset
df.head(10)

Unnamed: 0,sepal length,sepal width,petal length,petal width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa


In [7]:
# Print the last 10 values (head) in the dataset
df.tail(10)

Unnamed: 0,sepal length,sepal width,petal length,petal width,class
140,6.7,3.1,5.6,2.4,Iris-virginica
141,6.9,3.1,5.1,2.3,Iris-virginica
142,5.8,2.7,5.1,1.9,Iris-virginica
143,6.8,3.2,5.9,2.3,Iris-virginica
144,6.7,3.3,5.7,2.5,Iris-virginica
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica


In [9]:
# Describe the dataset
df.describe()

Unnamed: 0,sepal length,sepal width,petal length,petal width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


## Filtering the dataset

One of the most important task in dealing with data is to be able to filter properties on it. For example, lets say that we want to separate our Iris dataset in each class, i.e, create 3 different DataFrames for each class. Pandas provides a very easy way to filter data by using logical mask. Lets take a look to the following example:

In [45]:
# Select only the data that correspond to Iris-setosa

mask_Iris_Setosa =  (df['class'] == 'Iris-setosa')
df_Iris_Setosa  = df[mask_Iris_Setosa]

print(df_Iris_setosa)


    sepal length  sepal width  petal length  petal width        class
0            5.1          3.5           1.4          0.2  Iris-setosa
1            4.9          3.0           1.4          0.2  Iris-setosa
2            4.7          3.2           1.3          0.2  Iris-setosa
3            4.6          3.1           1.5          0.2  Iris-setosa
4            5.0          3.6           1.4          0.2  Iris-setosa
5            5.4          3.9           1.7          0.4  Iris-setosa
6            4.6          3.4           1.4          0.3  Iris-setosa
7            5.0          3.4           1.5          0.2  Iris-setosa
8            4.4          2.9           1.4          0.2  Iris-setosa
9            4.9          3.1           1.5          0.1  Iris-setosa
10           5.4          3.7           1.5          0.2  Iris-setosa
11           4.8          3.4           1.6          0.2  Iris-setosa
12           4.8          3.0           1.4          0.1  Iris-setosa
13           4.3    

In [30]:
# we can do the same for the other two, for example: 

mask_Iris_Versicolour =  (df['class'] == 'Iris-versicolor')
df_Iris_Versicolour  = df[mask_Iris_Versicolour]


In [18]:
# Now we can describe the dataset

df_Iris_setosa.describe()

Unnamed: 0,sepal length,sepal width,petal length,petal width
count,50.0,50.0,50.0,50.0
mean,5.006,3.418,1.464,0.244
std,0.35249,0.381024,0.173511,0.10721
min,4.3,2.3,1.0,0.1
25%,4.8,3.125,1.4,0.2
50%,5.0,3.4,1.5,0.2
75%,5.2,3.675,1.575,0.3
max,5.8,4.4,1.9,0.6


In [31]:
df_Iris_Versicolour.describe()

Unnamed: 0,sepal length,sepal width,petal length,petal width
count,50.0,50.0,50.0,50.0
mean,5.936,2.77,4.26,1.326
std,0.516171,0.313798,0.469911,0.197753
min,4.9,2.0,3.0,1.0
25%,5.6,2.525,4.0,1.2
50%,5.9,2.8,4.35,1.3
75%,6.3,3.0,4.6,1.5
max,7.0,3.4,5.1,1.8


In [25]:
df_Iris_Virginica.describe()

Unnamed: 0,sepal length,sepal width,petal length,petal width,class
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica


## Joining datasets

Pandas also provides a very efficient way to join dataset:

In [44]:
df_Iris_Setosa_Virginica = pd.concat([df_Iris_setosa, df_Iris_Versicolour])

print(df_Iris_Setosa_Virginica)

    sepal length  sepal width  petal length  petal width            class
0            5.1          3.5           1.4          0.2      Iris-setosa
1            4.9          3.0           1.4          0.2      Iris-setosa
2            4.7          3.2           1.3          0.2      Iris-setosa
3            4.6          3.1           1.5          0.2      Iris-setosa
4            5.0          3.6           1.4          0.2      Iris-setosa
..           ...          ...           ...          ...              ...
95           5.7          3.0           4.2          1.2  Iris-versicolor
96           5.7          2.9           4.2          1.3  Iris-versicolor
97           6.2          2.9           4.3          1.3  Iris-versicolor
98           5.1          2.5           3.0          1.1  Iris-versicolor
99           5.7          2.8           4.1          1.3  Iris-versicolor

[100 rows x 5 columns]


## Creating new columns

Pandas DataFrames work like a python dictionary. So lets say that you want to create a new feature that called ```sepal length_width``` which is the sum of ```sepal width + sepal length``` so then:

In [46]:
df_Iris_Setosa_Virginica['sepal length_width'] = df_Iris_Setosa_Virginica['sepal length'] + df_Iris_Setosa_Virginica['sepal width']

df_Iris_Setosa_Virginica.head()

Unnamed: 0,sepal length,sepal width,petal length,petal width,class,sepal length_width
0,5.1,3.5,1.4,0.2,Iris-setosa,8.6
1,4.9,3.0,1.4,0.2,Iris-setosa,7.9
2,4.7,3.2,1.3,0.2,Iris-setosa,7.9
3,4.6,3.1,1.5,0.2,Iris-setosa,7.7
4,5.0,3.6,1.4,0.2,Iris-setosa,8.6


## Convert columns to numpy arrays

Pandas columns can be converted to numpy arrays by doing:

In [47]:
lw = df_Iris_Setosa_Virginica['sepal length_width'].values
print(type(lw))
#or

lw = df_Iris_Setosa_Virginica['sepal length_width'].to_numpy()
print(lw)

<class 'numpy.ndarray'>
[ 8.6  7.9  7.9  7.7  8.6  9.3  8.   8.4  7.3  8.   9.1  8.2  7.8  7.3
  9.8 10.1  9.3  8.6  9.5  8.9  8.8  8.8  8.2  8.4  8.2  8.   8.4  8.7
  8.6  7.9  7.9  8.8  9.3  9.7  8.   8.2  9.   8.   7.4  8.5  8.5  6.8
  7.6  8.5  8.9  7.8  8.9  7.8  9.   8.3 10.2  9.6 10.   7.8  9.3  8.5
  9.6  7.3  9.5  7.9  7.   8.9  8.2  9.   8.5  9.8  8.6  8.5  8.4  8.1
  9.1  8.9  8.8  8.9  9.3  9.6  9.6  9.7  8.9  8.3  7.9  7.9  8.5  8.7
  8.4  9.4  9.8  8.6  8.6  8.   8.1  9.1  8.4  7.3  8.3  8.7  8.6  9.1
  7.6  8.5]


## Python classes

Classes are used to create user-defined data structures. Classes define functions called methods, which identify the behaviors and actions that an object created from the class can perform with its data. Defining your first class:

In [48]:
class Animal:
  pass

The body of the Animal class consists of a single statement: the pass keyword. pass is often used as a placeholder indicating where code will eventually go. It allows you to run this code without Python throwing an error. The Animal class isn’t very interesting so lets define some properties for each animal. The properties that all Dog objects must have are defined in a method called .```__init__()```. Every time a new Dog object is created, ```.__init__()``` sets the initial state of the object by assigning the values of the object’s properties. That is, ```.__init__()``` initializes each new instance of the class.

You can give ```.__init__()``` any number of parameters, but the first parameter will always be a variable called self. When a new class instance is created, the instance is automatically passed to the self parameter in ```.__init__()``` so that new attributes can be defined on the object.

In [49]:
class Animal:
  def __init__(self, name, num_legs, num_eyes):
    self.name = name
    self.num_legs = num_legs
    self.num_eyes = num_eyes


Attributes created in ```.__init__()``` are called **instance attributes**. 

In [50]:
spider = Animal("spider", 8, 8)
dog = Animal("dog", 4, 2)
human = Animal("human", 2, 2)

Instance methods are functions that are defined inside a class and can only be called from an instance of that class. Just like ```.__init__()```, an instance method’s first parameter is always ```self```.

In [62]:
class Animal:
  def __init__(self, name, num_legs, num_eyes):
    self.name = name
    self.num_legs = num_legs
    self.num_eyes = num_eyes

  # Instance method
  def description(self):
    return f"{self.name} has {self.num_legs} legs and {self.num_eyes} eyes"
  
  # calling an instance method from the class
  def description2(self):
    animal_description = self.description()
    return animal_description.split()


In [63]:
spider = Animal("spider", 8, 8)
dog = Animal("dog", 4, 2)
human = Animal("human", 2, 2)

In [61]:
print(spider.description(), "but a", human.description())

spider has 8 legs and 8 eyes but a human has 2 legs and 2 eyes


You can also call Instance method within the class (check example above)

In [65]:
 print(dog.description2())

['dog', 'has', '4', 'legs', 'and', '2', 'eyes']
