<a href="https://colab.research.google.com/github/gitmystuff/DTSC4050/blob/main/Week_01-Introduction/Week_01_Python_Workbook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 01 - Python Workbook

In [None]:
# code example


# Python

* **Interpreted**: a computer program that directly executes instructions written in a programming or scripting language, without requiring them previously to have been compiled into a machine language program. (Compiled languages: Go, C++)
* **Object Oriented**: a programming paradigm based on the concept of "objects", which can contain data and code: data in the form of fields, and code, in the form of procedures. A common feature of objects is that procedures are attached to them and can access and modify the object's data fields.
* **Object**: An object is simply a collection of data (variables) and methods (functions) that act on those data. Similarly, a class is a blueprint for that object. We can think of a class as a sketch (prototype) of a house. It contains all the details about the floors, doors, windows, etc.
* **High-Level Language**: A high-level language (HLL) is a programming language such as C, FORTRAN, or Pascal that enables a programmer to write programs that are more or less independent of a particular type of computer. Such languages are considered high-level because they are closer to human languages and further from machine languages. Python, C#, C++, PHP, Java vs assembly language and machine code
* **Dynamic Semantics**: dynamic objects are instances of values contained into constructs in the code, and they exist at run-time level. Furthermore, we can assign to one object multiple values, since it will update itself, differently from a static semantic language. Namely, if we set a=2 and then a=’hello’, the string value will substitute the integer one as soon as the line is executed
* **Built-In Data Structures**: Organizing, managing, and storing data is important as it enables easier access and efficient modifications. Data Structures allows you to organize your data in such a way that enables you to store collections of data, relate them and perform operations on them accordingly. Python has implicit support for Data Structures which enable you to store and access data. Some of these structures are called List, Dictionary, Tuple and Set.
* **Dynamic Typing**: The term dynamic typing means that a compiler or an interpreter assigns a type to all the variables at run-time. The type of a variable is decided based on its value. The programs written using dynamic-typed languages are more flexible but will compile even if they contain errors.


In [None]:
# dynamic typing example


In [None]:
# low value integers are pre-allocated


### Data Structures

#### Primitive Types

* Integers
* Floats
* Strings
* Booleans

#### Non-Primitive Types (Collections)

* Lists
* Tuples
* Sets
* Dictionaries

## Python Core

### Numbers

Integral (whole numbers)

* Integers
* Booleans

Non-Integral

* Floats
* Complex
* Decimals
* Fractions

In [None]:
# numbers


In [None]:
# https://docs.python.org/3/library/fractions.html
from fractions import Fraction


## Lists, Dictionaries, Tuples, and Sets

### Collections

Sequences

* Mutable: Lists
* Immutable: Tuples and Strings

Sets

* Mutable: Sets
* Immutable: Frozen Sets

Mappings

* Dictionaries

In [None]:
# list


In [None]:
# multiple lines


In [None]:
# mixed datatypes


In [None]:
# mutable list


In [None]:
# tuple


In [None]:
# immutable tuple


In [None]:
# immutable string


In [None]:
# set


In [None]:
# add and remove from set but can't change a value


In [None]:
# dictionary


In [None]:
# change values by key


In [None]:
# adding to a dictionary


### Callables

* User-Defined Functions
* Classes
* Built-in Functions (e.g. len(), abs(), range(), etc.)
* Built-in Methods (e.g. my_list.append(x), my_list.extend(other_list), etc.)

In [None]:
# user defined function
def my_funct(name):
    return f'Hello {name}'

print(my_funct('Toadette'))

In [None]:
# class
class InfoKart:

    def __init__(self, name1, name2):
        self.name1 = name1
        self.name2 = name2

    def on_your_mark(self):
        return f'Drivers! {self.name1} and {self.name2}. On your mark...'

race = InfoKart('Toadette', 'Yoshi')
print(race.on_your_mark())

In [None]:
# built-in function


In [None]:
# built-in methods


## Len and Range

* Len is short for length
* Range is sequence of numbers, start through (stop - 1), step range(start, stop, step)

In [None]:
# len


## For and Comprehensions

In [None]:
# range


In [None]:
# range(start, stop, step)


In [None]:
# reverse order


In [None]:
# list comprehension


In [None]:
# dictionary comprehension


In [None]:
# for loop list
my_kart = ['Baby Daisy', 'Baby Luigi', 'Baby Mario']


In [None]:
# nested for loop

my_kart = [['Baby Daisy', 'Baby Luigi', 'Baby Mario'], ['Birdo', 'Bowser', 'Donkey Kong'], ['Princess Peach', 'Isabelle', 'Koopa Troopa']]


In [None]:
# nested comprehension


In [None]:
# enumerate


## If Elif Else

In [None]:
# if else
my_kart = ['Baby Daisy', 'Baby Luigi', 'Baby Mario']
for i, name in enumerate(my_kart):
    if name == 'Baby Luigi':
        print(i, f'{name} is here')
    else:
        print('Baby Luigi\'s not at this index')

In [None]:
# if elif else
my_kart = ['Baby Daisy', 'Baby Luigi', 'Baby Mario']
for i, name in enumerate(my_kart):
    if name == 'Baby Daisy':
        print(i, f'{name} is here')
    elif name == 'Baby Luigi':
        print(i, f'{name} is here')
    else:
        print(i, 'Baby Mario could be here')

## Errors

https://docs.python.org/3/library/exceptions.html

In [None]:
# zero division error


In [None]:
# type error


## Try Except Break Continue

In [None]:
# zero divison exception
a = 10
b = 0
try:
    a / b
except ZeroDivisionError:
    print('Ooops! Division by 0 not allowed.')

In [None]:
# type error and finally
a = 10
b = '1'
try:
    a + b
except TypeError:
    print('Ooops! Adding a string and number causes problems')
finally:
    print('But, I can still do things')

In [None]:
# operators https://www.tutorialspoint.com/python/python_basic_operators.htm
a = 0
b = 3
while a < 3:
    a += 1
    b -= 1
    print(a, b)

print('All Done')

In [None]:
# break
a = 0
b = 3
while a < 4:
    a += 1
    b -= 1
    print(a, b)
    try:
        a / b
    except ZeroDivisionError:
        print('Ooops!')
        break
    print('Still in while loop')

print('All Done')

In [None]:
# continue
a = 0
b = 3
while a < 4:
    a += 1
    b -= 1
    print(a, b)
    try:
        a / b
    except ZeroDivisionError:
        print('Ooops!')
        continue
    print('Still in while loop')

print('All Done')

# Numpy

* Scalars
* Vectors
* Matrices

In [None]:
# list to array
import numpy as np

my_list = [1, 2, 3, 4, 5, 6, 7, 8, 9]
print(type(my_list))
my_array = np.array(my_list)
print(type(my_array)) # n-dimensional array

In [None]:
# view list


In [None]:
# view array


In [None]:
# 3 x 3 array
my_kart = [['Baby Daisy', 'Baby Luigi', 'Baby Mario'], ['Birdo', 'Bowser', 'Donkey Kong'], ['Princess Peach', 'Isabelle', 'Koopa Troopa']]


In [None]:
# np.zeroes


In [None]:
# arrays from arange and reshape
import numpy as np


In [None]:
# scalar multiplaction


In [None]:
# scalar division


In [None]:
# element wise multiplication


In [None]:
# dot product


In [None]:
# https://nbviewer.org/github/jmportilla/Udemy-notes/blob/master/Lec%209%20-Indexing%20Arrays.ipynb


In [None]:
# universal functions
import random

np.random.randint(10, size=9).reshape(3, 3)

In [None]:
# some descriptive statistics
import numpy as np

stats_array = np.random.randint(10, size=9).reshape(3, 3)
print(stats_array)
print(stats_array.sum())
print(stats_array.mean())
print(stats_array.var())
print(stats_array.std())

# Pandas

In [None]:
# pandas read_csv
import pandas as pd

df = pd.read_csv('iris.csv')


In [None]:
# get a column by name
df['sepal_width'].head()

In [None]:
# get two columns by column name
df[['sepal_width', 'petal_length']].head()

In [None]:
# get two columns by position
df[df.columns[1:3]].head()

In [None]:
# The loc[] function uses the row index and column names
df.loc[1:3, ['sepal_width_', 'petal_length']].head()

In [None]:
# select two rows and start stop columns
df.loc[1:3, 'sepal_width': 'petal_width'].head()

In [None]:
# select 3rd row
df.loc[2, :]

In [None]:
# iloc uses index and column numbers
df.iloc[1:3, 1:3]

In [None]:
# using just an index
df.iloc[1]

In [None]:
# using at
df.at[1, 'sepal_width']

In [None]:
# using iat
df.iat[1, 1]

In [None]:
# filter by single category
df[df['sepal_width'] < 3].head()

In [None]:
# how much did we filter
print(df.shape)
print(df[df['sepal_width'] < 3].shape)

In [None]:
# multiple columns using or
df[(df['sepal_width'] < 3) | (df['petal_length'] > 4)].shape

In [None]:
# multiple columns using and
df[(df['sepal_width'] < 3) & (df['petal_length'] > 4)].shape

In [None]:
# query
df.query('`sepal_width` < 3').shape

In [None]:
# query and
df.query('`sepal_width` < 3 & `petal_length` > 4').shape

# Matplotlib and Seaborn

In [None]:
import matplotlib.pyplot as plt

# Sample data
x = [1, 2, 3, 4, 5]
y = [2, 4, 1, 3, 5]

# Create a line plot
plt.plot(x, y)

# Add labels and title
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.title("Simple Line Plot")

# Display the plot
plt.show()

**Explanation:**

* `import matplotlib.pyplot as plt`: Imports the Matplotlib library's `pyplot` module, which provides functions for creating plots.
* `x` and `y`: These lists hold the data points for the plot.
* `plt.plot(x, y)`: Creates a line plot using the data in `x` and `y`.
* `plt.xlabel()`, `plt.ylabel()`, `plt.title()`: Add labels to the x-axis, y-axis, and the plot title.
* `plt.show()`: Displays the generated plot.

**Additional Matplotlib Examples:**

* **Scatter plot:**

In [None]:
plt.scatter(x, y)
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.title("Simple Scatter Plot")
plt.show()

* **Bar chart:**

In [None]:
categories = ['A', 'B', 'C', 'D']
values = [10, 15, 5, 20]
plt.bar(categories, values)
plt.xlabel("Categories")
plt.ylabel("Values")
plt.title("Simple Bar Chart")
plt.show()

* **Histogram:**

In [None]:
data = [1, 2, 2, 3, 3, 3, 4, 4, 5]
plt.hist(data)
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.title("Simple Histogram")
plt.show()

**Introduction to Seaborn**

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Load a sample dataset
data = sns.load_dataset('iris')

# Create a scatter plot with colors based on species
sns.scatterplot(x='sepal_length', y='sepal_width', hue='species', data=data)

# Add title
plt.title("Scatter Plot with Seaborn")

# Display the plot
plt.show()

**Explanation:**

* `import seaborn as sns`: Imports the Seaborn library.
* `data = sns.load_dataset('iris')`: Loads the built-in 'iris' dataset from Seaborn.
* `sns.scatterplot(...)`: Creates a scatter plot using Seaborn.
    * `x` and `y`: Specify the columns for the x and y axes.
    * `hue`: Specifies a third column to color the points based on categories.
    * `data`: Specifies the dataset to use.
* `plt.title()`: Adds a title to the plot.
* `plt.show()`: Displays the generated plot.

**Additional Seaborn Examples:**

* **Histogram with density plot:**

In [None]:
sns.histplot(data['sepal_length'], kde=True)
plt.title("Histogram with Density Plot")
plt.show()

* **Box plot:**

In [None]:
sns.boxplot(x='species', y='sepal_length', data=data)
plt.title("Box Plot")
plt.show()

* **Violin plot:**

In [None]:
sns.violinplot(x='species', y='sepal_length', data=data)
plt.title("Violin Plot")
plt.show()

**Key Points for Teaching:**

* **Start with basics:** Begin with simple plots and gradually introduce more complex ones.
* **Explain the code:** Clearly explain each line of code and its purpose.
* **Hands-on practice:** Encourage students to code along and experiment with different parameters and datasets.
* **Highlight the differences:** Compare and contrast Matplotlib and Seaborn to show their strengths and weaknesses.
* **Use real-world examples:** Show how these libraries can be used to visualize real-world data.

Remember to adjust the complexity and examples based on your students' level of understanding and the goals of your class. Have fun teaching!

<div class="md-recitation">
  Sources
  <ol>
  <li><a href="https://github.com/colinchcode/colinchcode.github.io">https://github.com/colinchcode/colinchcode.github.io</a> subject to MIT</li>
  </ol>
</div>

# Sklearn

**1. Linear Regression**

In [None]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Load the diabetes dataset (or any dataset with numeric features and a target variable)
from sklearn.datasets import load_diabetes
diabetes = load_diabetes()
data = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
data['target'] = diabetes.target

# Split data into features (X) and target (y)
X = data.drop('target', axis=1)
y = data['target']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a linear regression model
model = LinearRegression()

# Train the model
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

**Explanation:**

* **Import necessary libraries:** `pandas` for data manipulation, `sklearn.linear_model` for the Linear Regression model, `train_test_split` to split the data, and `mean_squared_error` and `r2_score` for model evaluation.
* **Load the dataset:** Use `load_diabetes()` to get a sample dataset, or replace it with your own data.
* **Prepare the data:** Separate features (X) and the target variable (y).
* **Split data:** Divide the data into training and testing sets using `train_test_split`.
* **Create and train the model:** Initialize a `LinearRegression` model and train it using the training data.
* **Make predictions:** Use the trained model to predict the target variable on the test set.
* **Evaluate the model:** Calculate Mean Squared Error and R-squared to assess the model's performance.

**2. Logistic Regression**

In [None]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix

# Load the breast cancer dataset (or any dataset with numeric features and a binary target variable)
from sklearn.datasets import load_breast_cancer
breast_cancer = load_breast_cancer()
data = pd.DataFrame(breast_cancer.data, columns=breast_cancer.feature_names)
data['target'] = breast_cancer.target

# Split data into features (X) and target (y)
X = data.drop('target', axis=1)
y = data['target']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a logistic regression model
model = LogisticRegression()

# Train the model
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print(f"Confusion Matrix:\n {conf_matrix}")

**Explanation:**

* **Import necessary libraries:** Similar to linear regression, but import `LogisticRegression` for the model and `accuracy_score` and `confusion_matrix` for evaluation.
* **Load the dataset:** Use `load_breast_cancer()` or your own data with a binary target variable.
* **Prepare and split the data:** Similar to linear regression.
* **Create and train the model:** Initialize a `LogisticRegression` model and train it on the training data.
* **Make predictions:** Predict the target variable (classes) on the test set.
* **Evaluate the model:** Calculate accuracy and generate a confusion matrix to assess the model's performance in classification.

**Key Points for Teaching:**

* **Explain the concepts:** Briefly explain the theory behind linear and logistic regression.
* **Dataset selection:** Choose datasets that are easy to understand and relevant to your students.
* **Feature importance:** Discuss the importance of feature selection and engineering.
* **Model evaluation:** Explain different evaluation metrics and their significance.
* **Hands-on practice:** Encourage students to experiment with different datasets and model parameters.

Remember to tailor the code and explanations to your students' level of understanding and the specific goals of your class. Good luck!

<div class="md-recitation">
  Sources
  <ol>
  <li><a href="https://blog.gopenai.com/creating-your-own-large-language-model-step-by-step-guide-4cada28c13ad">https://blog.gopenai.com/creating-your-own-large-language-model-step-by-step-guide-4cada28c13ad</a></li>
  <li><a href="https://medium.com/@nandiniverma78988/ridge-regression-also-known-as-l2-regularization-is-a-linear-regression-technique-used-in-0b3935dddfd0">https://medium.com/@nandiniverma78988/ridge-regression-also-known-as-l2-regularization-is-a-linear-regression-technique-used-in-0b3935dddfd0</a></li>
  <li><a href="https://blog.csdn.net/u010916338/article/details/105990192">https://blog.csdn.net/u010916338/article/details/105990192</a></li>
  <li><a href="https://github.com/koppolisubramanyam/gdk">https://github.com/koppolisubramanyam/gdk</a></li>
  <li><a href="https://www.analyticsvidhya.com/blog/2023/07/using-data-science-to-identify-top-twitter-influencers/">https://www.analyticsvidhya.com/blog/2023/07/using-data-science-to-identify-top-twitter-influencers/</a></li>
  <li><a href="https://medium.com/@conniezhou678/decoding-my-musical-journey-insights-from-spotify-track-data-visualized-with-matplotlib-and-41263c819bb0">https://medium.com/@conniezhou678/decoding-my-musical-journey-insights-from-spotify-track-data-visualized-with-matplotlib-and-41263c819bb0</a></li>
  <li><a href="https://medium.com/@johnmccool_83148/predict-customer-nps-with-machine-learning-8aab1a2aeee1">https://medium.com/@johnmccool_83148/predict-customer-nps-with-machine-learning-8aab1a2aeee1</a></li>
  <li><a href="https://github.com/FutureInsightTech/FutureIsnight-Site">https://github.com/FutureInsightTech/FutureIsnight-Site</a> subject to MIT</li>
  <li><a href="https://github.com/drcfsorg/DRCFS_Chitwan_ML_Bootcamp">https://github.com/drcfsorg/DRCFS_Chitwan_ML_Bootcamp</a></li>
  <li><a href="https://github.com/Gonnabattula-Sravani/Bharat-intern">https://github.com/Gonnabattula-Sravani/Bharat-intern</a></li>
  <li><a href="https://www.sarthaks.com/3530205/artificial-intelligence">https://www.sarthaks.com/3530205/artificial-intelligence</a></li>
  <li><a href="https://buffml.com/titanic-dataset-classification-using-python/">https://buffml.com/titanic-dataset-classification-using-python/</a></li>
  <li><a href="https://medium.com/@shashikumarsiva12/logistic-regression-algorithm-an-introduction-to-binary-classification-4bbf8fc655c5?responsesOpen=true&sortBy=REVERSE_CHRON">https://medium.com/@shashikumarsiva12/logistic-regression-algorithm-an-introduction-to-binary-classification-4bbf8fc655c5?responsesOpen=true&sortBy=REVERSE_CHRON</a></li>
  </ol>
</div>

# Statsmodels and SciPy

**Statsmodels**

Statsmodels is a powerful library for statistical modeling, providing a wide range of statistical tests, models, and diagnostic tools. Here are a couple of examples:

**1. Ordinary Least Squares (OLS) Regression**

In [None]:
import statsmodels.api as sm
import pandas as pd
from sklearn.datasets import load_diabetes

# Load the diabetes dataset
diabetes = load_diabetes()
data = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
data['target'] = diabetes.target  # Target variable is a quantitative measure of disease progression one year after baseline

# Define the dependent and independent variables
X = data[['bmi', 's5']]  # Body mass index and a blood serum measurement
y = data['target']

# Add a constant term to the independent variables
X = sm.add_constant(X)

# Create and fit the OLS model
model = sm.OLS(y, X)
results = model.fit()

# Print the regression results
print(results.summary())

This code demonstrates how to perform OLS regression using `statsmodels`. It loads the Boston housing dataset, selects two predictor variables (`RM` and `LSTAT`), adds a constant term, and fits the model. The `results.summary()` function provides a comprehensive output with statistical details, including coefficients, R-squared, p-values, and more.

**2. Analysis of Variance (ANOVA)**

In [None]:
import statsmodels.formula.api as smf
import pandas as pd

# Create a sample dataset
data = {'group': ['A', 'A', 'B', 'B', 'C', 'C'],
        'value': [10, 12, 15, 18, 20, 22]}
df = pd.DataFrame(data)

# Fit the ANOVA model
model = smf.ols('value ~ group', data=df)
results = model.fit()

# Perform ANOVA table
anova_table = sm.stats.anova_lm(results, typ=2)

# Print the ANOVA table
print(anova_table)

This example shows how to conduct ANOVA using `statsmodels`. It creates a sample dataset with groups and their corresponding values. The `ols` function from `statsmodels.formula.api` is used to specify the model using R-style formulas. The `anova_lm` function then performs the ANOVA analysis and generates a table with F-statistics, p-values, and other relevant information.

**Scipy**

Scipy is a library for scientific computing that builds on NumPy and provides a wide range of algorithms and functions for various scientific tasks. Here's an example:

**1. T-test**

In [None]:
from scipy import stats
import numpy as np

# Generate two sample datasets
group1 = np.random.normal(loc=10, scale=2, size=20)
group2 = np.random.normal(loc=12, scale=2, size=20)

# Perform independent samples t-test
t_stat, p_value = stats.ttest_ind(group1, group2)

print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

This code demonstrates how to perform an independent samples t-test using `scipy.stats`. It generates two sample datasets and uses the `ttest_ind` function to calculate the t-statistic and p-value. This test is used to determine if there is a significant difference between the means of two independent groups.

**Key Points for Teaching:**

* **Explain the purpose:** Clearly explain the purpose and applications of each library.
* **Focus on practical examples:** Use simple, relatable examples to demonstrate the functionality.
* **Connect to statistical concepts:** Emphasize the connection between the code and the underlying statistical concepts.
* **Encourage exploration:** Encourage students to explore the documentation and experiment with different functions and datasets.

By introducing `statsmodels` and `scipy` with these examples, you can provide your students with a solid foundation in statistical analysis and scientific computing in Python. Remember to adapt the complexity and examples to your students' level and the specific goals of your class.

<div class="md-recitation">
  Sources
  <ol>
  <li><a href="https://medium.com/@gururajab/linear-regression-a-comprehensive-guide-8d4ac0714ec1">https://medium.com/@gururajab/linear-regression-a-comprehensive-guide-8d4ac0714ec1</a></li>
  <li><a href="https://www.ml-zhuang.club/0521/787/">https://www.ml-zhuang.club/0521/787/</a></li>
  </ol>
</div>