<a href="https://colab.research.google.com/github/simao-f/Business-Data-Science/blob/main/M1%20Introduction%20to%20Data%20Handling%2C%20Exploration%20%26%20Applied%20Machine%20Learning/M1-Python-101.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preface

## Why Python?
1. General purpose language - add what you need
2. Portable (Linux, Windows, Mac)
3. Interactive
4. Free
5. Community and eco-system
6. Easy to use

## Working with Python
Workflows - many - find your own! In this course - Jupyter notebook and Pandas:
* Python + Jupyter notebook + Pandas = A complete environment
* Interactive
* Encourage an iterative work process (research?)
* Documentation, code and visualization in one - literate programming
* Reproducing results and figures

# Introduction

In this session, you will learn the basic Python syntax for data manipulation & analysis, including:

1. General syntax
2. basic operations
3. Object & data types
4. Flow controls




In [None]:
import numpy as np # Basic library for all kind of numerical operations
import pandas as pd # Basic library for data manipulation in dataframes

# Basics

## The very basics

In [None]:
# Running a cell (Ctrl-Enter, Shift-Enter)
print('Hello world')

Hello world


## Variables

In [None]:
i = 6
print(i, type(i))

6 <class 'int'>


In [None]:
x = 3.2
print(x, type(x))

3.2 <class 'float'>


In [None]:
s = 'Hello'
print(s, type(s))

Hello <class 'str'>


## Value assignment & evaluation

In [None]:
x = 3         # Assignment
print('We asigned x the value of ', x)              # Evaluate the expression and print result

We asigned x the value of  3


In [None]:
y = 4         # Assignment
y + 5         # Evaluation, y remains 4

9

In [None]:
z = x + 17*y  # Assignment
z             # Evaluation

71

In [None]:
# basic mathematical operations
print(x+y, x*y, x-y, x/y, a**2, x+y**2, (x+y)**2)

10.0 16.0 6.0 4.0 4.0 12.0 100.0


## Value comparison

Comparisons return boolean values: True or False

In [None]:
2==2  # Equality

True

In [None]:
2!=2  # Inequality

False

In [None]:
x <= y # less than or equal: "<", ">", and ">=" also work

True

In [None]:
(x | z) >= y

True

In [None]:
(x & z) >= y

False

In [None]:
x + z / 50 < y

False

## Special Constraints, NA, NaN, Inf

In [None]:
print([1, None, 3])

[1, None, 3]


## Importing
We need to import libraries or only parts of libraries all the time. Use name-conventions when doing so

In [None]:
from math import sqrt

In [None]:
a = 2
b = 3

c = sqrt(a**2 + b**2)
print(c)

3.605551275463989


## Functions
* Define a function
* Function name: pythagoras
* Arguments: a, b
* Indentation using tab (4 spaces) for the whole function
* `return` statement

In [None]:
#@title
def pythagoras(a, b):
    return sqrt(a**2 + b**2) # Notice the tab!

In [None]:
#@title
print(pythagoras)

<function pythagoras at 0x7fa0dd8877a0>


In [None]:
#@title
c = pythagoras(a, b)
print(c)

3.605551275463989


In [None]:
#@title
some_list = [(2,4),(6,7),(8,9),(1,6)]
pd.DataFrame(some_list)

Unnamed: 0,0,1
0,2,4
1,6,7
2,8,9
3,1,6


In [None]:
#@title
[pythagoras(stuff[0],stuff[1]) for stuff in some_list]

[4.47213595499958, 9.219544457292887, 12.041594578792296, 6.082762530298219]

**Best practice: ** Adding documentation via
* Doc-string (""")
* Try placing the cursor at the function and press `<shift+tab>`

In [None]:
def pythagoras(a, b):
    """
    Computes the length of the hypotenuse of a right triangle

    Arguments
    a, b: the two lengths of the right triangle
    """

    return sqrt(a**2 + b**2)

## Mini-assignment
* Construct a function that given two points $(x_1, y_1), (x_2, y_2)$ on a line computes the slope $a$ of the line
$$ y = ax + b$$
given by
$$ a = \frac{y_2- y_1}{x_2 - x_1}$$

# Flow Control (loops & friends)

Python is made for readability and therefore tabs and new lines have syntax meaning


In [None]:
# If/else controls
x = 5
y = 10

if (x==0):
  y = 0
else:
  y = y/x
  print(y)

2.0


In [None]:
# For loops
for i in range(1,x+1):
  print("OMG, i just counted to " + str(i))

OMG, i just counted to 1
OMG, i just counted to 2
OMG, i just counted to 3
OMG, i just counted to 4
OMG, i just counted to 5


In [None]:
# While loop
x = 5

while x > 0:
  print(x)
  x = x-1

5
4
3
2
1


In [None]:
x = 1

while True:
  print(x)
  x = x + 1
  if x > 7:
    break

1
2
3
4
5
6
7


In [None]:
even = [] # empty list
for i in range(10):
    even.append(i*2)
even

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

In [None]:
odd = []
for i in even:
    odd.append(i+1)
odd

[1, 3, 5, 7, 9, 11, 13, 15, 17, 19]

### Mini-assignment

Write a function `KtoC` that translates Kelvin to Celcius

$$ C = K - 273.15 \quad \text{with} \quad C\geq - 273.15$$

The function returns `None` when $C < -273.15$

## Error Handling

In Python, errors and exceptions can be managed using try-except blocks.

### Basic Error Handling

For example, let's see what happens when you try to divide by zero and how to handle it.

In [None]:
try:
    result = 10 / 0
except ZeroDivisionError:
    print("You can't divide by zero!")

### Multiple Errors

You can also handle multiple exceptions in a single try-except block.

In [None]:
x = "a"

try:
    result = 10 / int(x)
except ZeroDivisionError:
    print("You can't divide by zero!")
except ValueError:
    print("Invalid input; please enter a number.")




### Else and Finally Clauses

You can also use 'else' and 'finally' clauses with try-except blocks.
The code in 'else' will run if the try block doesn't raise an exception,
and 'finally' will run regardless of whether an exception is raised or not.




In [None]:
try:
    result = 10 / 2
except ZeroDivisionError:
    print("You can't divide by zero!")
else:
    print("Division successful!")
finally:
    print("This block of code will always run.")

### Custom Exceptions

You can also raise your own exceptions using the 'raise' keyword.

In [None]:
def pythagoras_with_error_check(a, b):
    """
    Computes the length of the hypotenuse of a right triangle.
    Raises a custom exception if either a or b is negative.
    """
    if a < 0 or b < 0:
        raise ValueError("Sides of a right triangle cannot be negative.")
    return sqrt(a ** 2 + b ** 2)

In [None]:
# Let's see how our custom exception works.
try:
    print(pythagoras_with_error_check(-3, 4))
except ValueError as e:
    print(e)




#Object classes


## Vector

One-dimensional collection of values

In [None]:
# Numeric
v1 = [1,5,11,33] # [] initiate a list
v1

[1, 5, 11, 33]

In [None]:
# String
v2 = ["hello","world"]
v2

['hello', 'world']

In [None]:
# Boolean
v3 = [True, True, False, True]
v3

[True, True, False, True]

Evaluating elements in vectors

In [None]:
v1[0]

1

In [None]:
v1[1:3]

[5, 11]

Manipulatingg vector elements

In [None]:
v1[2] = 1337
v1

[1, 5, 1337, 33]

Combining different types of elements you obtain a list of lists (later) with all elements in their original format

In [None]:
v5 =[v1, v2, v3]
v5
# Integers (numbers) are still numbers, not strings (text). Easy to see because they don't have ' '

[[1, 5, 1337, 33], ['hello', 'world'], [True, True, False, True]]

Adding vectors will append them (not sum them)

In [None]:
v1 + v3

[1, 5, 1337, 33, True, True, False, True]

In [None]:
# Same for multiplication
v1 * 2

[1, 5, 1337, 33, 1, 5, 1337, 33]

**Element-wise operations:** To do numerical operations on vectors
numpy.arrays. NumPy is a library, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. Here, you can already see that Python is a CS language.

In [None]:
v1_array = np.array(v1)
v2_array = np.array(v2)
v3_array = np.array(v3)

In [None]:
v1_array

array([   1,    5, 1337,   33])

In [None]:
v1_array + 5

array([   6,   10, 1342,   38])

In [None]:
v1_array + v3_array

array([   2,    6, 1337,   34])

In [None]:
# Arrays of different size
v1_array + np.array([1,7])

ValueError: ignored

In [None]:
# non-numerical arrays
v1_array + v2_array

UFuncTypeError: ignored

**Mathematical operations over the vector:** For most maths you need to engage numpy or other modules (Python is not per sea maths language)

In [None]:
# that works the same way
np.sum(v1)

1376

In [None]:
np.mean(v1)

344.0

In [None]:
# Standard deviation for population - DeltaDegreesOfFreedom = 0 by default
np.std(v1, ddof=0)

573.4413657907842

In [None]:
np.std(v1, ddof=1)

662.1530538075518

In [None]:
np.corrcoef(v1,v1)

array([[1., 1.],
       [1., 1.]])

Also consider this cheat sheet

https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf

## Lists
* An indexable collection of variables (objects)
* C-style or 0-indexed

In [None]:
l = ['Caroline', 1.0, pythagoras]
type(l)

list

In [None]:
l

['Caroline', 1.0, <function __main__.pythagoras(a, b)>]

In [None]:
l[0]

'Caroline'

In [None]:
type(l[0])

str

Common methods for lists

In [None]:
l.append(sqrt(2.0))
l

['Caroline', 1.0, <function __main__.pythagoras(a, b)>, 1.4142135623730951]

In [None]:
a = l.pop(2)
a

<function __main__.pythagoras(a, b)>

In [None]:
l

['Caroline', 1.0, 1.4142135623730951]

In [None]:
l.pop(0)
l.append(100)
l.sort(reverse=True)

In [None]:
l

[100, 1.4142135623730951, 1.0]

In [None]:
l[1] = 2
l

[100, 2, 1.0]

In [None]:
l.extend([6.0, 4])
l

[100, 2, 1.0, 6.0, 4]

## Introduction to List Comprehensions

List comprehensions provide a concise way to create lists. They can replace `for` loops for certain tasks, making your code more readable and usually faster. Below are a few examples to help you understand the basics:

### Basic Syntax

The basic syntax of list comprehension looks like this:

```python
new_list = [expression for item in iterable]


In [None]:
# Create a list of squares of numbers from 0 to 9
squares = [x*x for x in range(10)]
print(squares)

In [None]:
# Create a list of even numbers from a given list
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9]
even_numbers = [x for x in numbers if x % 2 == 0]
print(even_numbers)

In [None]:
# Finding Multiples of 2:
multiples_of_2 = [x for x in numbers if x % 2 == 0]
print(f"Multiples of 2: {multiples_of_2}")

In [None]:
# Finding Multiples of 2 or 3:
multiples_of_2_or_3 = [x for x in numbers if x % 2 == 0 or x % 3 == 0]
print(f"Multiples of 2 or 3: {multiples_of_2_or_3}")

In [None]:
# Create a list of strings with length greater than 2 from a given list of strings
words = ['apple', 'bat', 'cat', 'dog']
long_words = [word for word in words if len(word) > 2]
print(long_words)

### Task

Your task is to create a list comprehension that iterates through a list of numbers and stores only the even numbers in a new list. Also, if the number is a multiple of 3 and even, store its square instead of the number itself.

### Instructions

1. Start with the list `numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]`.
2. Create a list comprehension that follows the rules mentioned above.
3. Store the result in a variable called `processed_list`.
4. Print the `processed_list`.

In [None]:
# Your code here
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

### Expected Output

Your `processed_list` should look like this:

```python
[2, 4, 36, 8, 10]


## Tuples
* Immutable "lists"

In [None]:
t = (1.0, 4.0)
t, type(t)

((1.0, 4.0), tuple)

In [None]:
t[1]

4.0

In [None]:
t[1] = 2

TypeError: ignored

## Mini-assignment: Error Handling Practice with Loop and List

Your task is to write a function called `safe_division_list` that takes a list of tuples as an argument.
Each tuple contains two elements, `a` and `b`, for which you will perform division of `a` by `b`.
The function should handle errors appropriately and append a proper message to the result list for each of the following scenarios:

1. If `b` is zero, append "Cannot divide by zero."
2. If either `a` or `b` is not a number, append "Inputs must be numbers."
3. If the division is successful, append the result to the list.

Finally, the function should return the list of results.

In [None]:
#Your Code Here

Uncomment the code below to write your function

def safe_division_list(tuple_list):
    results = []
    for a, b in tuple_list:
        # Your error handling code here
        pass
    return results

In [None]:
# Testing Your Code

test_list = [(10, 2), (10, 0), (10, 'a'), ('a', 2), (9, 3)]
print(safe_division_list(test_list))
Should print [5.0, "Cannot divide by zero.", "Inputs must be numbers.", "Inputs must be numbers.", 3.0]

## Dictionaries
- Like lists with user-definable indices
- Can, like lists and tuples, contain a mix of different types of data.
- The indices can *also* be different kinds of data - unlike lists and tuples.

In [None]:
d = {'one': 1, 2: 1 + 1, 3.0: 'three'}
d

{'one': 1, 2: 2, 3.0: 'three'}

Usefull methods

In [None]:
d.keys()

dict_keys(['one', 2, 3.0])

In [None]:
d.items()

dict_items([('one', 1), (2, 2), (3.0, 'three')])

In [None]:
some_value = d.pop(3.0)
d

{'one': 1, 2: 2}

In [None]:
some_value

'three'

In [None]:
d['four'] = 4
d

{'one': 1, 2: 2, 'four': 4}

In [None]:
d.update({'five': 5.0, 6: 6.0})
d

{'one': 1, 2: 2, 'four': 4, 'five': 5.0, 6: 6.0}

## Data Frames

In Python Data Frames are managed by Pandas, a very comprehensive library for data manipulation and analysis.

We will introduce to it later more in detail, so here only brief:

In [None]:
# We construct the DF from a dictionary which is indicated by {'some_key':['some_values']}

df1 = pd.DataFrame(
    {'ID':range(1,5), # Python counts from 0 and the last value in a range is excluded
     'FirstName':["Jesper","Jonas","Pernille","Helle"],
     'Female':[False,False,True,True],
     'Age':[22,33,44,55]
})

In [None]:
# Python doesn't really do much factors and as you can see pandas understood your input formats
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   ID         4 non-null      int64 
 1   FirstName  4 non-null      object
 2   Female     4 non-null      bool  
 3   Age        4 non-null      int64 
dtypes: bool(1), int64(2), object(1)
memory usage: 228.0+ bytes


In [None]:
df1.FirstName #dot notation

0      Jesper
1       Jonas
2    Pernille
3       Helle
Name: FirstName, dtype: object

In [None]:
df1['FirstName'] #more traditional subsetting

0      Jesper
1       Jonas
2    Pernille
3       Helle
Name: FirstName, dtype: object

In [None]:
df1.loc[:,'FirstName'] #more complex subsetting

0      Jesper
1       Jonas
2    Pernille
3       Helle
Name: FirstName, dtype: object

In [None]:
df1.iloc[:,1] #index based

0      Jesper
1       Jonas
2    Pernille
3       Helle
Name: FirstName, dtype: object

In [None]:
# Rows 1 and 2, columns 3 and 4 - the gender and age of Jesper & Jonas
df1.iloc[[0,1],[2,3]]


Unnamed: 0,Female,Age
0,False,22
1,False,33


In [None]:
#Same thing
df1.loc[[0,1],['Female','Age']]

Unnamed: 0,Female,Age
0,False,22
1,False,33


In [None]:
# Rows 1 and 3, all columns

df1.iloc[[0,2],:] # don't forget to count index-1 when going from R to python

Unnamed: 0,ID,FirstName,Female,Age
0,1,Jesper,False,22
2,3,Pernille,True,44


In [None]:
#Find the names of everyone over the age of 30 in the data
df1[df1.Age > 30]

Unnamed: 0,ID,FirstName,Female,Age
1,2,Jonas,False,33
2,3,Pernille,True,44
3,4,Helle,True,55


In [None]:
# or "Query style" (There are always many ways of doing the same thing)
df1.query('Age > 30')

Unnamed: 0,ID,FirstName,Female,Age
1,2,Jonas,False,33
2,3,Pernille,True,44
3,4,Helle,True,55


## Pandas Exercise: Analyzing Sales Data

### Objective
In this exercise, you will analyze a sales dataset and answer some questions. The dataset contains information about products, their sales, and the profits made.

### Step 1: Importing Data
First, import the Pandas library. We have already created a DataFrame named `sales_df` containing the sales data for you.

### Step 2: Basic Analysis
Perform some basic analyses on the data.

1. Display the first 5 rows of the DataFrame.
2. Get summary statistics using the `.describe()` method.

### Step 3: Answer the Following Questions
Using Pandas functionalities, please answer the following questions:

1. What is the total sales amount?
2. What is the average profit?
3. Which product has the highest sales?
4. Which product has the lowest profit?

In [None]:
# Create a sample DataFrame
data = {
    'Product': ['A', 'B', 'A', 'C', 'A', 'B', 'C'],
    'Sales': [1000, 1500, 900, 1200, 850, 1300, 1100],
    'Profit': [200, 300, 180, 250, 170, 220, 210]
}
sales_df = pd.DataFrame(data)


## Further Studies and Recommendations

If you're interested in diving deeper into Python, Pandas, and data analysis, here are some recommended resources that can help you on your learning journey.

### Recommended Readings
1. "Python for Data Analysis" by Wes McKinney - An excellent book to get you well-versed in using Pandas for data analysis.
2. "Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow" by Aurélien Géron - A great read for anyone looking to venture into machine learning after mastering Pandas.

### Recommended DataCamp Courses
1. [Data Manipulation with Pandas](https://www.datacamp.com/courses/data-manipulation-with-pandas) - A comprehensive course covering data manipulation using Pandas.
2. [Pandas Joins for Spreadsheet Users](https://www.datacamp.com/courses/pandas-joins-for-spreadsheet-users) - A specialized course focused on joining methods in Pandas, useful for those transitioning from Excel.
3. [Introduction to Data Science in Python](https://www.datacamp.com/courses/introduction-to-data-science-in-python) - A beginner-friendly course that touches on various aspects of data science, including data manipulation with Pandas.

### Online Blogs and Tutorials
1. [Pandas Official Documentation](https://pandas.pydata.org/docs/) - It’s always good to understand the official documentation.
2. [Towards Data Science](https://towardsdatascience.com/) - A Medium publication with lots of tutorials and articles on data science, including Pandas.
3. [Stack Overflow](https://stackoverflow.com/questions/tagged/pandas) - An invaluable resource for solving specific issues you might encounter.

### YouTube Channels
1. [Data School](https://www.youtube.com/user/dataschool) - Offers a variety of tutorials, including several focused on Pandas.
2. [Corey Schafer](https://www.youtube.com/user/schafer5) - Includes tutorials on Python and various libraries including Pandas.