# PUC COURSE - EEIGM 
# BLOCK 1: DATA ANALYSIS 

Raul Benitez, Universitat Polotècnica de Catalunya, raul.benitez@upc.edu

**SCHEDULE AND CONTENTS**

0. Introduction to Python programming (1h)
1. Exploratory data analysis in python (1h)
2. Visualisation of mutidimensional data (1h)
3. Data clustering (2h)
4. Supervised classification algorithms (3h)



## 0.Introduction to Python programming

## Why Python?
Python is a high-level programming language that allows you to solve general-purpose problems. It is especially useful in data processing, pattern recognition and artificial intelligence applications. Think of a kind of very powerful programmable scientific calculator.

The objective of this first part of the course is to hace a crash course on Python programming. In the following laboratory sessions you will learn how to use Python in order to do data analysis and machine learning. 

## The google Colaboratory platform: Python notebooks

There are a large number of programming environments that allow you to develop programs with Python. The most common ones require the installation in your computer of a complete Python programming platform such as Anaconda (https://www.anaconda.com/, for MacOS and Windows) or WinPython (https://winpython.github.io/, Windows only). These distributions work perfectly fine but require lots of ressources from your personal computers in terms of disk, memory and CPU. 

### Google colab
Instead, in the course we will use a google cloud platform called google collaboratory that already has most of the necessary libraries and allows us to work without the need to install anything on our computers:

https://colab.research.google.com/

The only thing it asks us to start working is a gmail google account. In addition, it allows us to run the programs on google servers, which can be very useful when we manipulate large amounts of data.

### Python notebooks
The way to program on this platform are the **Python notebooks**: A notebook is a document that combines text and Python code. A notebook is composed of two different kinds of cells: code cells and text cells. Text cells are used to explain what we do at any given time, and the Python code excutes different code segments that perform different functions.  Here is a brief tutorial of how to get started working with juputer notebooks in Google colab: 

https://colab.research.google.com/notebooks/intro.ipynb

The text in the notebooks is written in a simple markdown language called **Markdown**. To have a brief introduction to markdown language just take a look to https://colab.research.google.com/notebooks/markdown_guide.ipynb

## Executing commands

In order to execute a command, you should write it in a **code cell** and execute it by either clicking in the play icon on the left or by the keyboard shortcut Shift+Return. 

In [29]:
3 + 4 + 9

16

In [30]:
list(range(10))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

## Modules and libraries

Python is a general-purpose programming language, so when we want to use more specific commands (such as statistical operators or string processing operators) we usually need to import them before we can use them. For Scientific Python, one of the most important libraries that we need is **numpy** (Numerical Python, NumPy: Basic mathematical operations with vectors and arrays: https://numpy.org/), which can be loaded like this:

In [31]:
import numpy as np
np.sqrt(25)

5.0

In [32]:
np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Access to the functions, variables and classes of a module depends on the way the module was imported:

In [33]:
import math
math.cos(math.pi)

-1.0

In [34]:
import math as m  # import using an alias
m.cos(m.pi)

-1.0

In [35]:
from math import cos,pi # import only some functions
cos(pi)

-1.0

In [36]:
from math import *   # global import
cos(pi)

-1.0

Besides Numpy, the most important in the field of data analysis are the following:

**Scipy**: Basic Statistical Operations: https://www.scipy.org/

**Pandas**: Allows you to manipulate data files in excel, csv, text, etc. https://pandas.pydata.org/

**Seaborn**: Graphing data in pandas format: https://seaborn.pydata.org/

**Matplotlib**: Plotting nnumpy arrays: https://matplotlib.org/

**sckit-learn**: Pattern recognition and artificial intelligence tools https://scikit-learn.org/stable/

**sckit-image**: Image processing https://scikit-image.org/

## Variables
Often the value returned by an operation will be used later on. Values can be stored for later use with the **assignment operator**:

In [37]:
a = 101
type(a)

int

The command has stored the value 101 under the name <code>a</code>. Such stored values are called **objects**. 

Making an assignment to an object defines the object. Once an object has been defined, it can be referred to and used in later computations. 

To refer to the value stored in the object, just use the object’s name itself. For instance:

In [38]:
np.sqrt(a)

10.04987562112089

In [39]:
a = np.sqrt(a)
a
type(a)

numpy.float64

There are some general rules for object names:

+ Use only letters and numbers and ‘underscores’ (_)
+ Do NOT use spaces anywhere in the name
+ A number cannot be the first character in the name
+ Capital letters are treated as distinct from lower-case letters (i.e., Python is case-sensitive)

In [40]:
3a = 10

SyntaxError: ignored

## Printing out variables:

You can check the value of a variable by printing it. You can also embed the value of the variable in a string using the *format* option of the print instruction: 

In [None]:
a = 19
print('The value of a is {}'.format(a))

In [None]:
print(a)

## A simple program in Python

General Rules:

+ All text from a <code>#</code> simbol to the end of a line are considered as comments.
+ Code must be **indented** and sometimes delineated by colons. The Python standard for indentation is four spaces. Never use tabs: it can produce hard to find errors. Set you editor to convert tabs to spaces.
+ Typically, a statement must be on a line. You can use a backslash <code>\</code> at the end of a line to continue a statement on to the next line.


In [None]:
# This program computes the factorial of 100.

fact = 1
n= 100
for factor in range(n,0,-1):
    fact = fact * factor 
print(fact)    

## Reading and handling data

Since we are going to deal with a wide range of data types (text, numerical arrays, images, etc.). 

When working with google colab, the first thing you need to do is to grant access to the files in your google drive:


In [None]:
from google.colab import drive 
drive.mount('/content/gdrive')

Once you have access to your google drive, the files are available at the folder /content/gdrive. In order to list the folders in the root your drive use the listdir function from the os library:

In [None]:
import os

root_folder = '/content/gdrive/My Drive/'

for kdirs in os.listdir(root_folder):
    if os.path.isdir(os.path.join(root_folder, kdirs)):
        print(kdirs)

If you want to dig into the contents of several folders then use the function walk instead

In [None]:
import os

root_folder = '/content/gdrive/MyDrive/data_course'

for root, dirs, files in os.walk(root_folder, topdown=False):
    for name in dirs:
        print(os.path.join(root, name))

### Read excel with Pandas

In most cases, we are using the pandas library in order to read data. Download the shared folder DAPR sample data from 

https://drive.google.com/drive/folders/1zKHhi3ZbhO8lUwtDAOo7qFZJlIVQjqL4?usp=sharing

and place it in the root of your google drive:

In [None]:
import pandas as pd
df = pd.read_excel('/content/gdrive/My Drive/DAPR sample data/test.xlsx', index_col=0,header=2)
df.head()

In [None]:
df['Humans']

In [None]:
df['Robots'].values

### Read data from web repository

We can also access to data stored in a GitHub repository:

Load updated coronavirus data from https://github.com/datadista/datasets.git


In [None]:
!wget -O ccaa_covid19_fallecidos.csv 'https://raw.githubusercontent.com/datadista/datasets/master/COVID%2019/ccaa_covid19_fallecidos.csv'

The file has been downloaded to folder /content

In [None]:
import pandas as pd
d = pd.read_csv('/content/ccaa_covid19_fallecidos.csv',sep=',',index_col=None)
d = d.drop('cod_ine',1)
d.head(20)


### Load built-in datasets

You can access to built-in databases included in many libraries: 

Let's acces, for instance, the mpg cars database included in the graphical libraries seaborn:

In [None]:
import pandas as pd 
import seaborn as sns 
mpg = sns.load_dataset("mpg")
mpg.head()

Sclkit-learn is another common library to load built-in datasets

https://scikit-learn.org/stable/datasets/toy_dataset.html



In [None]:
from sklearn.datasets import load_iris
data = load_iris()
data.target[[10, 25, 50]]

list(data.target_names)

#1.Exploratory data analysis in python

# 2. Visualisation of mutidimensional data 

# 3. Data clustering


# 4. Supervised classification algorithms