<a href="https://colab.research.google.com/github/rcm/AnaliseDados/blob/main/PP_T_aula01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Mining

## CRISP-DM

![CRISP-DM](https://www.datascience-pm.com/wp-content/uploads/2021/02/CRISP-DM.png)

1. Business understanding
2. Data understanding
3. Data preparation
4. Modeling
5. Evaluation
6. Deployment

## Business understanding
- Business objectives
- Assessment
- Data mining goals
- Project plan

## Data understanding
- Colect
- Describe
- Visualize
- Verify quality

## Data preparation
- Select
- Clean
- Integrate
- Format
- Standardize/Discretize/Transform
- Create new fieds

## Modeling
- Select techniques
- Test design
- Build model
- Assess model

## Evaluation
- Evaluate results
- Review process
- Decide next step

## Data mining
- Learn from data
- Find common patterns
- Predict future results

### Characteristics
- Find patterns in data
- Gather insight
- Use existing data in order to make intelligent decisions
- We want to find patterns in data in order to gather insight or help us take decisions
- Structural descriptions are quite useful


### Common problems
- Data quality may be poor
- Data may be missing or noisy
- Patterns found may not be exact, not useful or simply not interesting

### Common data mining tasks
- Predictive
  - **Classification** Predict the class
  - **Regression** Predict a number
  - **Time Series Analysis** Predict next value in the series
- Descriptive
  - **Clustering** Grouping by similarities
  - **Summarization** Summarize and generalize data
  - **Association Rules** Discover connection between items
  - **Sequence Discovery** Find patterns in sequential data

## Data Cleaning
1. Import data
2. Merge datasets
3. Handle missing data
4. Standardization
5. Normalization
6. Remove duplicates

## Data Transformation
- **Transforming** Applying a function to an attribute
- **Summarizing** Grouping
- **Discretization**
  - Transform numeric values into categorical ones
  - Binarizing
  - Binning
- **Handle missing values**
  - Remove records
  - Imputation
    - Replace by a value
    - Use median/mean
    - Predict using kNN
- **Standardization** Substitute values by the *Z score*
- **Normalization** Rescale values to the **[0; 1]** interval

## Data
A table contains:
- **Atributes** or columns of the dataset
- **Records** or lines of the dataset
- Each attribute represents a variable (e.g., salary, weight)
- Each record represents an example


| Name | Weight | Height | Sport |
|:----:|:------:|:------:|:-----:|
| John | 80 | 180|Tennis|
|Mary|160|50|Rugby|
|Chris|170|68|Judo|

# Python refresh course
- lambda functions
- enumerate
- zip
- list and set comprehension
 

In [None]:
f = lambda x: x ** 2

# Numpy
- [Documentação](https://numpy.org/doc/stable/user/index.html#user)
- [Tutorial 1](https://numpy.org/doc/stable/user/quickstart.html)
- [Tutorial 2](https://cs231n.github.io/python-numpy-tutorial/)

In [None]:
import numpy as np
a = np.array([1,2,3])
print(f"{a.size} {a.ndim} {a.shape}")
print(a[np.newaxis, :], a[:, np.newaxis], sep = '\n\n')

In [None]:
A = np.arange(1,10).reshape(3,3)
B = np.arange(12).reshape(3,4)
C = np.array([1,3,7,2,1,9,5,2,8]).reshape(3,3)
I = np.eye(3)
UM = np.ones((3,4))
ZERO = np.zeros((4,4))
D = np.array([9,8,5])
UM


In [None]:
# 3 lançamentos de 20 dados
dados = np.random.multinomial(20, [1/6.]*6, size=3)
alturas = np.random.normal(180, 15, size = (3,6))
np.set_printoptions(precision=2)
print(alturas, alturas.T, alturas.ravel(), "", sep = "\n\n")


In [None]:
cA = A.copy()
cA[0:2, 1:3] = 3
print(A, cA, sep = "\n\n")

In [None]:
print(np.hstack((A,C)), np.vstack((A,C)), sep = "\n\n")
np.concatenate((A**2, B, dados), axis = 1)

In [None]:
Z = np.arange(12, 0, -1).reshape((3,4))
print(Z)
np.apply_along_axis(lambda x: np.sort(x), 0, Z)


In [None]:
np.fromfunction(lambda x, y: x + 10 * y, (3,2))

In [None]:
print(C, D, sep = "\n\n")
np.linalg.solve(C,D)

In [None]:
x = np.arange(1,11)
np.outer(x, x)

In [None]:
np.tile(np.array([1,2,3]), (4,1))

In [None]:
A + [1,0,1]


In [None]:
np.array([1,2,4]) @ np.array([2,1,3])

In [None]:
print(A, C, sep = "\n\n")
A * C

In [None]:
A @ C

In [None]:
A[(A>2) & (A < 7)]

In [None]:
B[0:2,1:3]


In [None]:
list(zip(*np.nonzero((A>2) & (A < 7))))


In [None]:
A[np.nonzero((A>2) & (A < 7))]


In [None]:
B[[0,2,1],[2,1,3]]


In [None]:
A[(A>2) & (A < 7)] += 10

print(f"""Matriz:
{A}
Soma das colunas: {np.sum(A, axis = 0)}
Soma das linhas: {np.sum(A, axis = 1)}
Soma cumulativa por colunas:
{np.cumsum(A, axis = 0)}
Soma cumulativa por linhas:
{np.cumsum(A, axis = 1)}
Soma: {np.sum(A)}""")

np.cumsum(A)

In [None]:
import matplotlib.pyplot as plt
N = 8
y = np.zeros(N)
x1 = np.linspace(0, 10, N, endpoint=True)
x2 = np.linspace(0, 10, N, endpoint=False)
plt.plot(x1, y, 'o')
plt.plot(x2, y + 0.5, 'o')
plt.ylim([-0.5, 1])
plt.show()

In [None]:
import  matplotlib.pyplot as plt
x = np.linspace(-np.pi, np.pi, 100)
plt.plot(x, np.sin(x), '--', x, np.cos(x), '-.')

Exemplo de least squares polynomial fit

In [None]:
x = np.array([0.0, 1.0, 2.0, 3.0,  4.0,  5.0])
y = np.cos(x)

#array([0.0, 0.8, 0.9, 0.1, -0.8, -1.0])
z = np.polyfit(x, y, 3)
p = np.poly1d(z)
p5 = np.poly1d(np.polyfit(x, y, 5))
xp = np.linspace(-2, 6, 100)
plt.plot(x, y, '.', xp, p(xp), '-', xp, p5(xp), '--')
plt.ylim(-2,2)

# Pandas

In [None]:
import pandas as pd

nome = "Joana Carlos João Filipa Susana".split()
altura = [156, 178, 183, 172, 178]
peso = [52,70,74,68,72]

pd.DataFrame(dict(nome = nome, altura = altura, peso = peso))

In [None]:
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
dataframe = pd.read_csv(url)
dataframe.head(10)

In [None]:
dataframe.describe()

In [None]:
dataframe = dataframe.drop('Name', axis=1)
dataframe[(dataframe['Sex'] == 'female') & (dataframe['Age'] >= 60)]

In [None]:
dataframe['Sex'].value_counts()

In [None]:
dataframe['Pclass'].value_counts()

In [None]:
dataframe['Pclass'].nunique()

# Examples involving grouping
This example will:
1. drop duplicates by sex
1. count line in each group (survived or not)
1. Show average age grouped by sex and survived

In [None]:
dataframe.drop_duplicates(subset=['Sex'])

In [None]:
dataframe.groupby('Survived').count()

In [None]:
dataframe.groupby('Survived')["PassengerId"].count()

In [None]:
 dataframe.groupby(['Sex','Survived'])['Age'].mean()

# Handling missing values
You may:
- drop all lines containing missing values
- show lines where the age is not missing
- show lines where age is null

In [None]:
dataframe.head(2)

In [None]:
dataframe.dropna().head(2)

In [None]:
dataframe[dataframe.notnull()["Age"]]

In [None]:
dataframe[dataframe['Age'].isnull()].head(2)

# Imputing missing values
- Imputing means substituting missing values
- In the first example, since some columns are categorical and other numerical, we are using the *most frequent* strategy
- In the other example, we are using the median to fill the values for the age

In [None]:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy="most_frequent")
df = pd.DataFrame(imp.fit_transform(dataframe))
df.columns = dataframe.columns
df.index = dataframe.index
df

In [None]:
imp = SimpleImputer(strategy='mean')
df = dataframe.copy()
df['Age'] = imp.fit_transform(dataframe[["Age"]])
df

# One hot encoding
OHE means substituting a categorical attribute by several binary attributes, one for each value

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelBinarizer

dataframe = pd.DataFrame()
dataframe["Name"] = "Rui Joana Francisca Pedro".split()
dataframe["Age"] = [30,20,22,24]
dataframe["City"] = "Braga Porto Coimbra Braga".split()

city_enc = LabelBinarizer()
enc = city_enc.fit_transform(dataframe["City"])
enc_df = pd.DataFrame(enc, columns = city_enc.classes_)
new_df = pd.concat([dataframe, enc_df], axis = 1)#.drop("City", axis = 1)
new_df

# Encoding ordinal features
In the case of ordinal features, the strategy is different

In [None]:
df = pd.DataFrame([
        dict(name="Rui", age = "21", score="high"),
        dict(name="Clara", score="medium", age=20),
        dict(name="Pedro",age=19,score="low"),
        dict(name="Joana",age=19,score="high")])
df.score = df.score.replace(dict(low=0, medium=1, high=4))
df

# Scaling
Example of scaling the values between 0 and 1

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
new_df["Age"] = scaler.fit_transform(new_df[["Age"]])
new_df

# Standardizing
We are substituting **Idade** by the standardized value. In this case, the mean becomes 0 and the standard deviation will be 1.

In [None]:
df= pd.DataFrame(np.array([["Rui", 45, "Braga"],
        ["Joana", 29, "Braga"],
        ["Carla", 29, "Porto"],
        ["Marcos", 27, "Lisboa"]]), columns = "Nome Idade Cidade".split())
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
df["Idade"] = sc.fit_transform(df[["Idade"]])
df

# Applying functions

In [None]:
df= pd.DataFrame(np.array([["Rui", 45, "Braga"],
        ["Joana", 29, "Braga"],
        ["Carla", 29, "Porto"],
        ["Marcos", 27, "Lisboa"]]), columns = "Nome Idade Cidade".split())
df["Nome"] = df["Nome"].apply(lambda x: x.upper())
df["Idade"] = df["Idade"].astype(int).apply(lambda x : x + 10)
df

In [None]:
df.groupby("Cidade").apply(lambda x: x.sum())

In [None]:
df.groupby("Cidade")["Idade"].apply(lambda x: x.mean())

# Detecting outliers by IQR

In [None]:
url="https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
wine = pd.read_csv(url, sep = ";")
def outliers(x):
        q1, q3 = np.percentile(x, [25, 75])
        iqr = q3 - q1
        lower_bound = q1 - (iqr * 1.5)
        upper_bound = q3 + (iqr * 1.5)
        return (x < lower_bound) | (x > upper_bound)
wine[outliers(wine.iloc[:,0])].head()

# Discretizing

In [None]:
nwine = wine.copy()
nwine["alcoolStrong"] = np.digitize(wine["alcohol"], [11])
nwine["alcoholD"] = np.digitize(wine["alcohol"], np.percentile(wine["alcohol"], [25, 50, 75]))
nwine[["alcohol", "alcoholD", "alcoolStrong"]]

# Plotting

In [None]:
import seaborn as sns
import matplotlib
sns.set_theme(style="whitegrid")
cmap = sns.cubehelix_palette(rot=-.2, as_cmap=True)

iris = sns.load_dataset("iris")
sns.relplot(x="sepal_length", y = "sepal_width", hue = "petal_width", size = "petal_length", style = "species", palette=cmap, sizes=(10, 200), data  = iris)
g = sns.PairGrid(iris, hue = "species")
g.map_upper(sns.scatterplot)
g.map_lower(sns.kdeplot)
g.map_diag(sns.kdeplot, lw=3, legend=False)

In [None]:
sns.pairplot(iris, hue="species")

# Example of binarization

In [None]:
lenses = """young           myope                   no     reduced      none
young           myope                   no      normal      soft
young           myope                  yes     reduced      none
young           myope                  yes      normal      hard
young           hypermetrope            no     reduced      none
young           hypermetrope            no      normal      soft
young           hypermetrope           yes     reduced      none
young           hypermetrope           yes      normal      hard
pre-PB          myope                   no     reduced      none
pre-PB          myope                   no      normal      soft
pre-PB          myope                  yes     reduced      none
pre-PB          myope                  yes      normal      hard
pre-PB          hypermetrope            no     reduced      none
pre-PB          hypermetrope            no      normal      soft
pre-PB          hypermetrope           yes     reduced      none
pre-PB          hypermetrope           yes      normal      none
PB              myope                   no     reduced      none
PB              myope                   no      normal      none
PB              myope                  yes     reduced      none
PB              myope                  yes      normal      hard
PB              hypermetrope            no     reduced      none
PB              hypermetrope            no      normal      soft
PB              hypermetrope           yes     reduced      none
PBX              hypermetrope           yes      normal      none"""
lenses = pd.DataFrame([line.split() for line in lenses.splitlines()], columns = "Age Prescription Astigmatism TearProdRate ContactLenses".split())

def add_one_hot(column):
    enc = LabelBinarizer()
    enc.fit_transform(column)
    if len(enc.classes_) > 2:
        return pd.DataFrame(enc.fit_transform(column), columns = enc.classes_)
    else:
        return pd.DataFrame(enc.fit_transform(column), columns = [enc.classes_[0]])


df = pd.DataFrame()
for attr in "Age Prescription Astigmatism TearProdRate".split():
        df = pd.concat([df, add_one_hot(lenses[attr])], axis = 1)
df.describe()

# Tutorials

- [Numpy](https://numpy.org/doc/stable/user/quickstart.html)
- [Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html)
- [Seaborn](https://seaborn.pydata.org/tutorial.html)
- [scikit-learn](https://scikit-learn.org/stable/tutorial/index.html)
