In [None]:
# Remove warnings
import warnings
warnings.filterwarnings('ignore')

# Structured data

## Data as variables

In [None]:
# variables
customer1_age = 38
customer1_height = 178
customer1_loan = 34.23
customer1_name = 'Zajac'

> Why don't we use variables for data analysis?

In Python, regardless of the type of data being analyzed and processed, we can collect data and represent it as a form of `list`.

In [None]:
# python lists - what we can put on list ?
customer = []
print(customer)

In [None]:
# different types in one object
type(customer)

> Why lists aren't the best place to store data?

Let's take two numerical lists."

In [None]:
# two numerical lists
a = [1,2,3]
b = [4,5,6]

Typical operations on lists in data analysis

In [None]:
# add lists
print(f"a+b: {a+b}")
# we can use .format also 
print("a+b: {}".format(a+b))

In [None]:
# multiplication
try:
    print(a*b)
except TypeError:
    print("no-defined operation")

In [None]:
import numpy as np
aa = np.array(a)
bb = np.array(b)

print(aa,bb)

In [None]:
print(f"aa+bb: {aa+bb}")
# add - working
try:
    print("="*50)
    print(aa*bb)
    print("aa*bb - is this correct ?")
    print(np.dot(aa,bb))
    print("np.dot - is this correct ?")
except TypeError:
    print("no-defined operation")
# multiplication

In [None]:
# array properties
x = np.array(range(4))
print(x)
x.shape

In [None]:
A = np.array([range(4),range(4)])
# transposition  row i -> column j, column j -> row i 
A.T

In [None]:
# 0-dim object
scalar = np.array(5)
print(f"scalar object dim: {scalar.ndim}")
# 1-dim object
vector_1d = np.array([3, 5, 7])
print(f"vector object dim: {vector_1d.ndim}")
# 2 rows for 3 features
matrix_2d = np.array([[1,2,3],[3,4,5]])
print(f"matrix object dim: {matrix_2d.ndim}")

<img src="tensory.png">


[Sebastian Raschka Course](https://sebastianraschka.com/blog/2020/numpy-intro.html)


## PyTorch 

[PyTorch](https://pytorch.org) is an open-source Python-based deep learning library. 
PyTorch has been the most widely used deep learning library for research since 2019 by a wide margin. In short, for many practitioners and researchers, PyTorch offers just the right balance between usability and features.

1. PyTorch is a tensor library that extends the concept of array-oriented programming library NumPy with the additional feature of accelerated computation on GPUs, thus providing a seamless switch between CPUs and GPUs.

2. PyTorch is an automatic differentiation engine, also known as autograd, which enables the automatic computation of gradients for tensor operations, simplifying backpropagation and model optimization.

3. PyTorch is a deep learning library, meaning that it offers modular, flexible, and efficient building blocks (including pre-trained models, loss functions, and optimizers) for designing and training a wide range of deep learning models, catering to both researchers and developers.


In [None]:
import torch

In [None]:
torch.cuda.is_available()

In [None]:
tensor0d = torch.tensor(1) 
tensor1d = torch.tensor([1, 2, 3])
tensor2d = torch.tensor([[1, 2, 2], [3, 4, 5]])
tensor3d = torch.tensor([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])

In [None]:
print(tensor1d.dtype)

In [None]:
torch.tensor([1.0, 2.0, 3.0]).dtype

In [None]:
tensor2d

In [None]:
tensor2d.shape

In [None]:
print(tensor2d.reshape(3, 2))

In [None]:
print(tensor2d.T)

In [None]:
print(tensor2d.matmul(tensor2d.T))

In [None]:
print(tensor2d @ tensor2d.T)

more info on [pytorch](https://pytorch.org/docs/stable/tensors.html)

## Data Modeling
Let's take one variable (`xs`) and one target variable (`ys` - target).
```python
xs = np.array([-1,0,1,2,3,4])
ys = np.array([-3,-1,1,3,5,7])
```

What kind of model we can use? 

In [None]:
# Regresja liniowa 

import numpy as np
from sklearn.linear_model import LinearRegression

xs = np.array([-1,0,1,2,3,4])
# a raczej 
xs = xs.reshape(-1, 1)

ys = np.array([-3, -1, 1, 3, 5, 7])

reg = LinearRegression()
model = reg.fit(xs,ys)

print(f"solution: x1={model.coef_[0]}, x0={reg.intercept_}")

model.predict(np.array([[1],[5]]))

The simple code fully accomplishes our task of finding a linear regression model.

What can we use such a generated model for?

To make use of it, we need to export it to a file.

In [None]:
# save model
import pickle
with open('model.pkl', "wb") as picklefile:
    pickle.dump(model, picklefile)

Now we can import it (for example, on GitHub) and utilize it in other projects.

In [None]:
# load model
with open('model.pkl',"rb") as picklefile:
    mreg = pickle.load(picklefile)

But !!! remember about Python Env

In [None]:
mreg.predict(xs)

## Neural Networks

In [None]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense

In [None]:
import tensorflow as tf

We can also look at this problem from a different perspective. 
Neural networks are also capable of solving regression problems

In [None]:
layer_0 = Dense(units=1, input_shape=[1])

model = Sequential([layer_0])

# compiling and fits
model.compile(optimizer='sgd', loss='mean_squared_error')
model.fit(xs, ys, epochs=10)

In [None]:
print(f"{layer_0.get_weights()}")

Other ways of acquiring data

1. Ready-made sources in Python libraries.
2. Data from external files (e.g., CSV, JSON, TXT) from a local disk or the internet.
3. Data from databases (e.g., MySQL, PostgreSQL, MongoDB).
4. Data generated artificially for a chosen modeling problem.
5. Data streams.

In [None]:
from sklearn.datasets import load_iris

iris = load_iris()

In [None]:
# find all keys
iris.keys()

In [None]:
# print description
print(iris.DESCR)

In [None]:
import pandas as pd
import numpy as np

# create DataFrame
df = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                  columns= iris['feature_names'] + ['target'])

In [None]:
# show last
df.tail(10)

In [None]:
# show info about NaN values and a type of each column.
df.info()

In [None]:
# statistics
df.describe()

In [None]:
# new features
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)

In [None]:
# remove features (columns) 
df = df.drop(columns=['target'])
# filtering first 100 rows and 4'th column

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="whitegrid", palette="husl")

iris_melt = pd.melt(df, "species", var_name="measurement")
f, ax = plt.subplots(1, figsize=(15,9))
sns.stripplot(x="measurement", y="value", hue="species", data=iris_melt, jitter=True, edgecolor="white", ax=ax)

In [None]:
X = df.iloc[:100,[0,2]].values
y = df.iloc[0:100,4].values

In [None]:
y = np.where(y == 'setosa',-1,1)

In [None]:
plt.scatter(X[:50,0],X[:50,1],color='red', marker='o',label='setosa')
plt.scatter(X[50:100,0],X[50:100,1],color='blue', marker='x',label='versicolor')
plt.xlabel('sepal length (cm)')
plt.ylabel('petal length (cm)')
plt.legend(loc='upper left')
plt.show()

For this type of linearly separable data, use logistic regression model or neural network.

In [None]:
from sklearn.linear_model import Perceptron

per_clf = Perceptron()
per_clf.fit(X,y)

y_pred = per_clf.predict([[2, 0.5],[4,5.5]])
y_pred

## Data Storage and Connection to a Simple SQL Database

In [None]:
IRIS_PATH = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
col_names = ["sepal_length", "sepal_width", "petal_length", "petal_width", "class"]
df = pd.read_csv(IRIS_PATH, names=col_names)

In [None]:
# save to sqlite
import sqlite3
# generate database
conn = sqlite3.connect("iris.db")
# pandas to_sql

try:
    df.to_sql("iris", conn, index=False)
except:
    print("tabela już istnieje")

In [None]:
# sql to pandas
result = pd.read_sql("SELECT * FROM iris WHERE sepal_length > 5", conn)

In [None]:
result.head(3)

In [None]:
# Artificial data
from sklearn import datasets
X, y = datasets.make_classification(n_samples=10**4,
n_features=20, n_informative=2, n_redundant=2)


from sklearn.ensemble import RandomForestClassifier


# train test split by heand
train_samples = 7000 # 70% 

X_train = X[:train_samples]
X_test = X[train_samples:]
y_train = y[:train_samples]
y_test = y[train_samples:]

rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)

In [None]:
rfc.predict(X_train[0].reshape(1, -1))

## ZADANIA

1. Load data from the `train.csv` file and put it into the panda's data frame

In [None]:
## YOUR CODE HERE
df = 

2. Show number of row and number of columns 

In [None]:
## YOUR CODE HERE


Perform missing data handling:

1. Option 1 - remove rows containing missing data (`dropna()`)
2. Option 2 - remove columns containing missing data (`drop()`)
3. Option 3 - perform imputation using mean values (`fillna()`)

Which columns did you choose for each option and why?

In [None]:
## YOUR CODE HERE


4. Using the `nunique()` method, remove columns that are not suitable for modeling.

In [None]:
## YOUR CODE HERE


5. Convert categorical variables using LabelEncoder into numerical form.

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
## YOUR CODE HERE

6. Utilize `MinMaxScaler` to transform floating-point data to a common scale

In [None]:
from sklearn.preprocessing import MinMaxScaler

## YOUR CODE HERE


7. Split the data into training set (80%) and test set (20%)

In [None]:
from sklearn.model_selection import train_test_split
## YOUR CODE HERE
X_train, X_test, y_train, y_test = train_test_split(...., random_state=44)

8. Using mapping, you can classify each passenger. The `run()` function requires providing a classifier for a single case.
   - Write a classifier that assigns a value of 0 or 1 randomly (you can use the `random.randint(0,1)` function).
   - Execute the `evaluate()` function and check how well the random classifier performs."

In [None]:
classify = ...

In [None]:
def run(f_classify, x):
    return list(map(f_classify, x))

def evaluate(predictions, actual):
    correct = list(filter(
        lambda item: item[0] == item[1],
        list(zip(predictions, actual))
    ))
    return f"{len(correct)} correct answers from {len(actual)}. Accuracy ({len(correct)/len(actual)*100:.0f}%)"

In [None]:
evaluate(run(classify, X_train.values), y_train.values)