## Section I: Pandas 

**Pandas** is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.

Pandas is well suited for many different kinds of data such as:

- Tabular data
- Time series data.
- Arbitary matrix data with row and column labels.



In [None]:
#Import Pandas library as "pd"
import pandas as pd
import numpy as np

### I.1. Introduction to Pandas's data structures

Pandas deals with the following three data structures: 

1 - Series: a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).

2 - DataFrame: a 2-dimensional labeled data structure with columns of potentially different types

3 - Panel: a 3D container of data

Let's focus on *Series*. There are multiple ways to create a series. The simplest way is creating from a NumPy array. 

Firstly, create a NumPy array `arr` with the following elements $[10,20,30,40,50]$, then construct a series object by using `arr` and `print` out the result. A Pandas Series can be created using the following constructor  `series = pd.Series(arr)`

In [None]:
arr=np.array([10,20,30,40,50])
print(arr)
series = pd.Series(arr)


[10 20 30 40 50]


Now, let's construct a series by using a  `dictionary`. Generally speaking, dictionary is a data structure contains multiple key-value pairs. For example, in Python, a dictionary can be defined by:
`my_dictionary = {'a':10, 'b':20, 'c':30}`

In [None]:
#THIS CODE IS PROVIDED AS AN EXAMPLE
# Creating a series from a Python dict
# Note that the keys of the dictionary are used to assign indexes during conversion
data = {'a':10, 'b':20, 'c':30}
series2 = pd.Series(data)
print(series2)

a    10
b    20
c    30
dtype: int64


With the table below, let's consider the student's name is the key and their weight is the value. Construct a dictionary to represent the table, then create a series by using that dictionary.

| Student| Weight|
| --- | ---|
| Alice | 60 |
| Bob   | 65 |
| Carol | 45 |

In [None]:
data = {'Alice' :60,
'Bob'	:65,
'Carol'	:45}
series = pd.Series(data)
print(series)

Alice    60
Bob      65
Carol    45
dtype: int64


In order to retrieve  a part of a series, you can use slicing as the same as NumPy array. Let's print the last two elements of the series you have just created in the previous question.

In [None]:
print(series[-2:])



Bob      65
Carol    45
dtype: int64


### I.2. DataFrames


- A DataFrame is a 2D data structure in which data is aligned in a tabular fashion consisting of rows & columns
- A DataFrame can be created using the following constructor - pandas.DataFrame(data, index, dtype, copy)
- Data can be of multiple data types such as ndarray, list, constants, series, dict etc.
- Index Row and column labels of the dataframe; defaults to np.arrange(n) if no index is passed
- Data type of each column
- Creates a deep copy of the data, set to false as default



In [None]:
#THIS CODE IS PROVIDED AS AN EXAMPLE
# Converting a list into a DataFrame
list1 = [11, 24, 32, 58]
table = pd.DataFrame(list1)
print(table)
# Creating a DataFrame from a list of dictionaries
data = [{'a':1, 'b':2}, {'a':2, 'b':4, 'c':8}]
table1 = pd.DataFrame(data)
print(table1)

# Note that NaN (not a number) is stored in areas where no data is provided
# Creating a DataFrame from a list of dictionaries and accompaying row indices
table2 = pd.DataFrame(data, index = ['first', 'second'])
# Dict keys become column lables
print(table2)
# Converting a dictionary of series into a DataFrame
data1 = {'one':pd.Series([1,2,3], index = ['a', 'b', 'c']),
        'two':pd.Series([1,2,3,4], index = ['a', 'b', 'c', 'd'])}
table3 = pd.DataFrame(data1)
print(table3)
# the resultant index is the union of all the series indexes passed

    0
0  11
1  24
2  32
3  58
   a  b    c
0  1  2  NaN
1  2  4  8.0
        a  b    c
first   1  2  NaN
second  2  4  8.0
   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4


Construct a dataframe, using the data from the following table: 

| Student| Course 1| Course 2|
| --- | ---| --- |
| Alice | 8.5 | 7.5 |
| Bob   | 6.0 | 6.0 |
| Carol | 7.5 | 6.5 |
| Dan   | 9.0 | 8.0 |

In [None]:
data1 = {'Student':pd.Series(['Alice', 'Bob', 'Carol', ' Dan']),'Course 1':pd.Series([8.5,6.0,7.5,9.0]),
        'Course 2':pd.Series([7.5,6.0,6.5,8.0])}
table3 = pd.DataFrame(data1)
print(table3)

  Student  Course 1  Course 2
0   Alice       8.5       7.5
1     Bob       6.0       6.0
2   Carol       7.5       6.5
3     Dan       9.0       8.0


### I.3. Importing & Exporting Data

Pandas will be used regularly when we want to read and write data to a CSV file. Data can be loaded into **DataFrames** from input data stored in the CSV format using the `read_csv()` function. We have preapared a sample csv file called `"height-weight.csv"` for you in this [link](https://drive.google.com/file/d/1xZfUv4sJofcJH4zedxt0eoGpVgQHEslu/view?usp=sharing). Dowload this file to your local machine.

In [None]:
#Now, let's upload it again
from google.colab import files
up = files.upload()

Answer the following questions:

*   Using `pd.read_csv("file_name.csv")` to read all the data into a dataframe
*   Print the dataframe
*   Print the first 5 records of the dataframe
*   Print the "Height" column of the dataframe
*   Print all records having the height higher than 1.5

In [None]:
df = pd.read_csv("height-weight.csv")
print (df)
print (df[:5])
height= (df["Height"])
print (height)
print (df[height > 1.5])




The Body Mass Index (BMI) is a simmple calculating using person's height and weight. The formula is:
$$BMI = \frac{weight}{height^2} $$ where $weight$ is a person's weight in kg and $height$ is their height in metres. 

---

* Calculate the BMI for each record in the given dataframe, and add this `BMI` column to the right of the `Height` column. Please refer to this [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mul.html#pandas.DataFrame.mul) to learn more about basic arithmetic operators in Pandas.
* Using `DataFrame.to_csv("file_name.csv")`, export this dataframe to a new `csv` file called `height-weight-bmi.csv`

In [None]:
BMI = df["Weight"] / (df["Height"]**2)
df['BMI'] = BMI
print (df)
#Now, let's upload it again

df.to_csv ('height-weight-bmi.csv')









## Section II: Matplotlib

What is **"Matplotlib"**?
- Matplotlib is a Python library that is specially designed for the development of graphs, charts etc., in order to provide interactive data visualisation
- Matplotlib is inspired from the MATLAB software and reproduces many of it's features

We will provide some examples for you in the following cells. This is a powerful tool, and it takes time for you to get familiar with it. For full documentation, please visit this [link](https://matplotlib.org/).

In [None]:
# Import Matplotlib submodule for plotting
import matplotlib.pyplot as plt

In [None]:
plt.plot([1,2,3,4]) # List of vertical co-ordinates of the points plotted
plt.show() # Displays plot
# Implicit X-axis values from 0 to (N-1) where N is the length of the list

In [None]:
x = np.linspace(start = -2, stop = 2,num = 50)
plt.plot(x, x, label = 'linear')
plt.plot(x, x*x, label = 'square')
plt.plot(x, x*x*x, label = 'cube')
plt.grid(True)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title("Polynomial Graph")
plt.legend()
plt.show()

In [None]:
x = [1, 2, 3, 4, 5, 6]
y = [11, 22, 33, 44, 55, 66]
plt.plot(x, y, 'bo')
for i in range(len(x)):
    x_cord = x[i]
    y_cord = y[i]
    plt.text(x_cord, y_cord, (x_cord, y_cord), fontsize = 10)
plt.show()

In [None]:
# Histograms display the distribution of a variable over a range of frequencies or values
y = np.random.randn(100, 100) # 100x100 array of a Gaussian distribution
plt.hist(y) # Function to plot the histogram takes the dataset as the parameter
plt.show()

In [None]:
# Bar charts are used to visually compare two or more values using rectangular bars
# Default width of each bar is 0.8 units
# [1,2,3] Mid-point of the lower face of every bar
# [1,4,9] Heights of the successive bars in the plot
plt.bar([1,2,3], [1,4,9])
plt.show()

With the given height-weight dataframe from the `height-weight.csv` file

---

* Plot the height vs weight on the same plot (the x-axis will be the height and the y-axis will be the weight)
* Plot the following lines on the same plot:
$$
(d_1): y = 61x - 39 \\
(d_2): y = 53x - 25
$$
(the plot should have the height-weight and both two lines). 
Hint: you should use `np.linspace` with the start, stop equal to the minimum height and the maximum height, respectively

In [None]:
height = df["Height"].to_numpy()
weight = df["Weight"].to_numpy()
plt.scatter (height, weight, color = "Orange")
plt.xlabel ("Height")
plt.ylabel ("Weight")

#x-values
x = np.linspace(start = height.min(), stop = height.max())
y = 61 * x - 39
print (x)
print (y)
plt.plot(x, y, label = "y = 61 * x - 39",color='red')
plt.legend (loc="lower right")
plt.plot(x, 53*x-25, label = "y = 53 * x - 25",color='green')
plt.legend (loc="lower right")


## Section III: Linear regression in Python (Part 1):

You have learned about linear regression in lectures. Linear Regression searches for relationships among variables. For example:
- Estimate a mathematical dependence of the prices of houses on their areas, numbers of bedrooms, distances to the city center?
- How salaries depend on the features, such as experience, level of education, role, city they work in, and so on?

Before going into details, look at the line $d_1$ and $d_2$ in the previous section. 
* What do you think about these two lines? 
* For the linear regression task, which line is better?
* Mathematically, given a pair of two lines, how can you qualify a given one is better than the other?  

The `Scikit-learn` is a machine learning library in Python which provides the `linear_model` we are focusing on. We can do Linear Regression in just two lines of code. We will provide the implementation for finding the best line below.

In [None]:
# Import Matplotlib submodule for plotting
import sklearn.linear_model as LM
import matplotlib.pyplot as plt
df     = pd.read_csv("height-weight.csv")
height = df.Height.values[:,np.newaxis]
weight = df.Weight.values
# Initialize a LINEAR REGRESSION model
lr=LM.LinearRegression() 
# Using model.fit(X (input values), y (prediction values)) to find the model
lr.fit(height,weight)
print("We can find the coefficient, and the bias of the model: ")
print("Coefficient        : ",lr.coef_)
print("Intercept (or bias):", lr.intercept_)

In 2-D cases, we can visualize the "best line" as you did in the visualizing question.  Now suppose you have a new height value, you are able to "guess" the weight. Let's try the cell below

In [None]:
knownvalue= float(input("Enter the value of height:"))
findvalue=lr.predict([[knownvalue]])
print("The height value is",knownvalue,"\nThe predicted weight value is",int(findvalue))

It's your turn now. we will do some experiments on the *QSAR fish toxicity Data Set*. 
Dowload this dataset from this link: [
QSAR fish toxicity Data Set](https://archive.ics.uci.edu/ml/datasets/QSAR+fish+toxicity).


Upload this dataset to the Google Colab (you can skip this step if you are doing this assignment on your machine)

We only focus on the following task: **Given the value of `MLOGP`, what is the response value `LC50`?**
Note that this `csv` file does not contain the column headers, you need to look at the "Data Set description" part for the index. 

---
- Using the `sklearn` to do linear regression. What is the coefficient and the bias of your model?
- Using `matplotlib` to plot the `MLOGP` vs `LC50`, and the best line that you found on the sameplot.



In [None]:
df = pd.read_csv("qsar_fish_toxicity (1).csv", delimiter = ";")
print(df)
plt.figure(figsize=(12, 8))
x = df.iloc[:,5].values
y = df.iloc[:,6].values
plt.scatter(x, y)
# Initialize a LINEAR REGRESSION model
lr=LM.LinearRegression() 
# Using model.fit(X (input values), y (prediction values)) to find the model
x = x.reshape(-1, 1)
y = y.reshape(-1, 1)
lr.fit(x,y)

print("We can find the coefficient, and the bias of the model: ")
print("Coefficient        : ",lr.coef_)
a=lr.coef_[0][0]
print(a)

print("Intercept (or bias):", lr.intercept_)
b=lr.intercept_[0]
print(b)
plt.plot(x, a*x+b, label = f"y = {a:.2f}x + {b}",color='red')
plt.legend (loc="lower right")

In [None]:
from google.colab import drive
drive.mount('/content/drive')