# Getting Started with Python

This notebook will introduce you to the fundamentals of programming in Python.


## 0. Jupyter Notebooks

### Welcome to your first Jupyter Notebook! 📓

**Jupyter Notebooks** (`.ipynb` files) are an interactive way to write and run code. They combine:
- 📝 **Markdown text cells** (like this one)
- 💻 **Code cells** that can be executed individually

> 💡 This is incredibly useful for learning and experimenting with code, as you can run small chunks of code and see the results immediately.

### How to use this notebook:

To run a single (selected) cell, press **`Shift + Enter`**:
- For **code cells** → this will run the code (which will be printed immediately below the cell).
- For **markdown cells** → this will render the markdown (which will be printed inside the cell)

At the top of the notebook, you will see a toolbar with various options. You can use these to:
- Add new cells (of either type).
- Restart the kernel (python will forget all variables).
- Clear all outputs (remove all printed output from code cells).

> 💡 Restarting the kernel + clearing outputs is useful if you want to 'start fresh'. It will NOT delete any code you have written.

---

*Read on, and you will see we will start switching between text and code cells...*

## 1. A little bit about your new friend (Python) 🐍

⭐ **Python emphasizes 'readability' and follows the philosophy that code should be written to be read by humans first, computers second.**

This is why Python often looks almost like plain English. **Make liberal use of in-line comments (using `#`) to explain your code.** This will help you (and others) understand it later!

In [1]:
#let's define a string variable (i.e., text):
my_string = "Hello, World!"

#now let's make the string uppercase:
my_string_upper = my_string.upper()
#notice that we choose an informative name for our modified variable.
#this is good practice, to keep track of what your variables contain.

#print the modified string to the screen:
print(my_string_upper)
#notice that we pass the variable that we just created (and modified) to the print() function, which does exactly what it says: it prints stuff to the screen.

#... now take a look

#Of course, your code will get more complex than this, but the same principles apply. Choose informative names for your variables, and use comments to explain what your code does.

HELLO, WORLD!


⭐ **Python is an 'object-oriented' programming language.**
 This means that it is made up of 'objects' that can `do` things (functions, methods) and `contain` things (data):

In [None]:
#And believe it or not, in the previous cell we already made use of all three of these!

my_string = 'Hello, World!'  # This is a string 'object' that contains the data "Hello, World!"
my_string_upper = my_string.upper()  # .upper() is a method that 'does' something to the string object that it 'belongs' to (it makes it uppercase).
print(my_string_upper)  # print() is a function that 'does' something to the object that we pass to it (it prints it to the screen).

#Functions are 'standalone' - they can be used on any object that is passed to them.
#Methods, on the other hand, are 'attached' to a specific object type, and can only be used on that type of object.

#This makes python a very powerful and flexible language - you can create your own objects, with their own methods and attributes.

⭐ **Python has a pre-loaded 'standard library'** - a massive collection of pre-written code modules that handle common tasks, so you don't have to write everything from scratch.


In [4]:
#Notice how we did not need to 'load in' anything to create a string variable, or use the print() function or the .upper() method in the previous cell.
print('Wow, so easy!')

#However, because of the highly specialized nature of python, there are many additional libraries (collections of pre-written code) that you will need to load in to do specific tasks.
#...this is VERY MUCH the case for machine learning, where we will be using libraries such as numpy, pandas, matplotlib, and sklearn.

#In order to use these (third-party) libraries, you must:

#1. Install the library (only need to do this once!):
% pip install numpy # the `%` is needed to run terminal commands from within a jupyter notebook.

#2. Import the library (you will need to do this at the start of your python code, or notebook)
import numpy as np #numpy is a library for numerical computing in python.

#Importing the library makes it available in the current 'namespace' (i.e., the current code file or notebook) for use.
#Abbreviating the library name (as np) is just a convention to make it easier to type repeatedly (you will get used to these).

#Now that we have imported numpy, we can use it to perform basic numerical operations:
result = np.mean([1,2,3,4,5]) #take the mean of a list of numbers (1 to 5).
print(result)

#That's it! Install, import, and use.

#PROTIP: most third-party libraries have extensive documentation online. A quick google search will usually get you to the right place:
#e.g., for numpy -  https://www.google.com/search?q=numpy+docs


Wow, so easy!


UsageError: Line magic function `%` not found.


⭐ **Python is 'dynamically typed'.** This means you don't have to declare what type of data a variable will hold (like numbers, text, etc.) - Python figures it out automatically when you run your code.

In [None]:
#for example, let's create two variables:
a_number = 1
another_number = 2.5
a_nice_thought = "more than the sum of its parts"

#now let's see what 'type' of object python decided to store these variables as:
print(type(a_number))          # <class 'int'> - integer (whole number)
print(type(another_number))    # <class 'float'> - floating point number (decimal number)
print(type(a_nice_thought))    # <class 'str'> - string (text)

#... see the outputs below? They will match the comments above.

#PROTIP, as convenient as this is, sometimes we want to be explicit about what type of data we want to store in a variable.
#We can do this by 'casting' the variable to a specific type:
an_integer = int(2.5)  #cast to integer (will truncate the decimal part)
a_float = float(2)     #cast to float (will add a decimal part)
a_string = str(123)    #cast to string (will convert the number to text)

#In practice, you will likely not bother with this until something goes wrong (e.g., you try to do math with a string, or concatenate a number to a string). That is OK!.

**Python uses 'indentation' (spaces or tabs) to organize code structure.** This makes Python code look clean and forces good formatting habits.

In [None]:
#for example, let's create a simple conditional statement:

secret_of_the_universe = 42 # the answer to life, the universe, and everything (stored as an integer)

#here we go:
if secret_of_the_universe == 42:
    print("You found the secret of the universe!")

#notice the indentation (4 spaces) before the print statement. This indicates that the print statement is part of the if statement. This means it will only run if the 'if' condition is true.
# (which it is, in this case, because we determined that the secret_of_the_universe variable is equal to 42).

#indentation is VERY important in python - and you will quickly warm up to it, is it makes code MUCH easier to read.

## 2. Data Types
> 💡 Use `type()` to check what data type a variable is: `type(my_variable)`

#### **Basic Data Types**
> ⚠️ These are the fundamental building blocks of python.

**Integers (`int`)** - Whole numbers
```python
age = 25
year = 2025
negative_number = -10
```

**Floats (`float`)** - Numbers with decimal points
```python
height = 5.9
temperature = 98.6
pi = 3.14159
```
> 💡 Numbers are the most essential building blocks of machine learning, but you will not often need to manipulate them directly (in this course).

In [None]:
#Try different basic math operations on numbers (e.g., +, -, *, /, //, %, **)

#Try importing tmean from scipy.stats and using it to calculate the mean of a list of numbers.

#TIP: if you want to import one part of a larger library, you can use the 'from' keyword:
from scipy.stats import tmean #this is saying: "from the scipy.stats library, import the tmean function"
#This is a good idea - especially with larger libraries - because it saves memory and makes your code run faster.

**Strings (`str`)** - Text data (enclosed in quotes)
```python
name = "Alice"
message = 'Hello, World!'
multiline = """This is a
very long string
that spans multiple lines"""
```
> 💡 You can use either single or double quotes to define strings in Python (be consistent). We will spend a good deal of time working with strings in this course.


In [None]:
#Define a string variable and try out some string methods on it (e.g., .upper(), .lower(), .replace(), .find(), .split(), .join()).

#See what happens when you use math operations on strings (e.g., +, *, etc.) and how that differs from using them on numbers.

#Define a string - then print the first character, last character, and a slice of the string (e.g., characters 2 to 5).

**Booleans (`bool`)** - True or False values
```python
is_student = True
has_graduated = False
```
> 💡 Booleans are very useful to keep track of 'states' in your code, to then be used in conditional statements. The 'T' and/or 'F' must be capitalized.

In [None]:
# Define a string variable
text = "Python is great"

# Create a boolean variable that checks if the text length is greater than 10
# HINT: use the len() function to get the length of the string

# Use the boolean variable in a print statement.
# HINT: print() can take multiple arguments, separated by commas.

# Try changing the text to different lengths and see what happens!
# Try: "Hi", "Programming", "Machine learning is awesome"

#### **Collection Data Types**
> ⚠️ This is where things get more interesting. Collections can store multiple items, and you can nest collections inside other collections!

> ⚠️ This is where you will first encounter the concept of 'iterability' - an object is iterable if it has more than one element in it.

> ⚠️ This is also where you will first encounter the concept of 'mutability' - whether or not you can change the contents of a collection after it is created.

**Lists (`list`)** - Ordered, mutable collections
```python
list_of_strings = ["apple", "banana", "orange"]
list_of_numbers = [1, 2, 3, 4, 5]
list_of_mixed = ["hello", 42, True, 3.14]  # Lists can contain different data types!
list_of_nested = [1, 2, [3, 4], 5]  # Lists can contain lists, or any other collection!
```
> 💡 You create a list using square brackets [].  You can then `.append()` something new to it, or `.extend()` it with an iterable.

> 💡 Lists are great for storing data that you need to change (e.g., news articles that you will pre-process).

In [None]:
#Try creating a list, and then printing the first item, last item, and a slice of the list (e.g., items 2 to 5).

#Try creating another list. Then, what happens if you .append() it to the first list? What if you use .extend() instead?

**Tuples (`tuple`)** - Ordered, immutable collections
```python
tuple_of_strings = ("apple", "banana", "orange")
tuple_of_numbers = (1, 2, 3, 4, 5)
tuple_of_mixed = ("hello", 42, True, 3.14)
tuple_of_nested = (1, 2, (3, 4), 5)
```
> 💡 You create a tuple using parentheses `()`. You cannot change the values of a tuple once it's created.

> 💡 Tuples are great for storing data that should not change (e.g., keywords for a literature review search).

In [None]:
#What happens if you try to .append() or .extend() something to a tuple?

#What happens if you define a tuple with a list nested inside it? Is the nested list mutable?

**Dictionaries (`dict`)** - mutable collection of key-value pairs (like a phonebook).
```python
phonebook = {
    "Alice": "0555-1234",
    "Bob": "0555-5678",
    "Charlie": "0555-8765"
}
# Access values using keys: phonebook["Bob"] returns "0555-5678"
```
> 💡 You create dicts using the format '{key:value}'. To add a new k:v pair, you can use the syntax `dict[key] = value`; or use `.update()` to add multiple k:v pairs at once.

> 💡 Dicts are extremely useful collections that you will use often, since they can store messy, nested data, and make retrieving it (by key) very fast.

In [None]:
#Create a dictionary, add some items to it, and then print an item by its key.

#Try adding multiple items to the dictionary at once using the .update() method.

#Try calling .keys(), .values(), and .items() on the dictionary - what do they return?


**Sets (`set`)** - Unordered, mutable collection of unique items.
```python
unique_numbers = {1, 2, 3, 4, 5}
colors = {"red", "green", "blue"}
# Sets automatically remove duplicates!
```
> 💡 You create a set using curly braces `{}`. Add items using the `.add()` method and remove items using the `.remove()` method.

> 💡 Sets can only contain immutable objects (e.g., not lists or dictionaries).

> 💡 Sets are great for storing unique items and performing mathematical set operations (like unions and intersections).

In [None]:
#Create a list of items, with one or more duplicates. Then cast it to a set using set(list). What happens to the duplicates?

#What happens if an iterable element nested within a set (e.g., a tuple) contains duplicates?
# e.g., as_an_example = {1, 2, 3, (1, 2, 3, 3), 4, 5, 6}

#### **Third-Party Data Types**
> ⚠️ Because of the flexible nature of python, there are many third-party libraries that define their own data types.
>
> ⚠️ Here we detail two that are of great relevance to machine learning and data science, that we will use in this course.

**NumPy Arrays (`numpy.ndarray`)** - mutable numerical arrays
```python
import numpy as np

# 1D array (like a list but much faster for math)
numbers = np.array([1, 2, 3, 4, 5])

# 2D array (like a matrix)
matrix = np.array([[1, 2, 3],
                   [4, 5, 6]])

# nD array (3D and beyond...)
tensor = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])

# Arrays are great for fast mathematical operations
result = numbers * 2  # Multiplies every element by 2
```
> 💡 NumPy arrays are more efficient than lists for numerical computations, and are the backbone of many machine learning libraries.

> 💡 You will not often need to work directly with arrays, as many libraries (like TensorFlow and PyTorch) manage them 'under the hood'.

In [None]:
#Create a list of numbers, and then convert it to a numpy array using np.array(list).

#Create two different lists. Then, create a 2D numpy array (matrix) using np.array([list1, list2]).

#What happens if you try to create a numpy array with lists of different lengths?

**Pandas DataFrames (`pandas.DataFrame`)** - Excel-like data tables
```python
import pandas as pd

# Create a DataFrame (like a spreadsheet)
data = {
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'city': ['Berlin', 'Munich', 'Hamburg']
}
df = pd.DataFrame(data)

# Access columns: df['name'] or df.name
# Access rows: df.iloc[0] (first row)
```
> 💡 DataFrames are incredibly powerful for data manipulation and analysis, with many built-in functions for filtering, grouping, and aggregating data.

> 💡 They should in many ways be the most 'familiar' data structure for anyone who has used excel (e.g., csv files), or tibbles in R.

> 💡 However, their usefulness is limited to data that can be arranged in a tabular format (i.e., rows and columns), which is often not the case for messy text data.

In [None]:
#create three lists of equal length (these will be our 'columns')

#Then, create a pandas DataFrame using pd.DataFrame({'col1': list1, 'col2': list2, 'col3': list3}).

#Describe the DataFrame using df.describe().

#Take a look at the first three rows of the DataFrame using df.head(3).

#Isolate a particular row or column of the DataFrame using df.iloc[] and df[].

#Check the 'type' of a single column of the DataFrame - is it still a list? or something else?

## 3. Control Structures
> 💡 Control structures allow you to design decision points and control the flow of your code.


### **Conditional Statements**
> ⚠️ Use these when you want your code to do different things based on different conditions.

**`if/elif/else` statements** - execute code based on boolean conditions
```python
temperature = 25

if temperature > 30:
    print("It's hot!")
elif temperature > 20:
    print("It's nice!")
else:
    print("It's cold!")
```

**Comparison operators** - compare values and return a boolean (`True` or `False`)
```python
# == (equal), != (not equal), > (greater), < (less), >= (greater or equal), <= (less or equal)
age = 18
is_adult = age >= 18  # Returns True or False
```

**Logical operators** - Combine multiple conditions
```python
# and, or, not
age = 25
has_license = True

if age >= 18 and has_license:
    print("Can drive!")
    
if age < 16 or not has_license:
    print("Cannot drive!")
```

**Membership operators** - Check if items exist in collections
```python
# in, not in
fruits = ["apple", "banana", "orange"]
if "apple" in fruits:
    print("We have apples!")
```

In [None]:
#try experimenting with conditional statements to control the flow of your code (e.g., if, elif, else).

### **Loops**
> ⚠️ Use loops when you want to do the same thing multiple times, or go through an iterable collection of data.

**For loops** - Iterate through known sequences
```python
# Loop through a list
fruits = ["apple", "banana", "orange"]
for fruit in fruits:
    print(f"I like {fruit}")

# Loop through numbers
for i in range(5):  # 0, 1, 2, 3, 4
    print(f"Count: {i}")
```

**While loops** - Repeat until a condition changes
```python
count = 0
while count < 3:
    print(f"Count is {count}")
    count += 1  # Same as: count = count + 1
```

In [None]:
#try experimenting with a simple loop, to modify an iterable (e.g., a list of strings) in some way (e.g., .upper(), .split(), etc.)

### **Flow Control**
> ⚠️ Use continue and break statements to 'skip' or 'exit' loops early under certain conditions.
```python
# break - exit the loop completely
# continue - skip to the next iteration

for i in range(10):
    if i == 3:
        continue  # Skip when i is 3
    if i == 7:
        break     # Stop when i is 7
    print(i)
```

In [None]:
#try to include flow controls into your loop from above.

## 4. Functions
> 💡 Functions are reusable blocks of code that perform specific tasks. You can define them once, then use them multiple times.

**Basic function** -
```python
def clean_text(text):                                        #...here we define a function called 'clean_text' that takes one argument, 'text'
    """Remove extra whitespace and convert to lowercase."""  #...this is a called a 'docstring' and it explains what the function does
    cleaned = text.strip().lower()                           #...this line does the actual work:  .strip() whitespace and .lower() the text
    return cleaned                                           #...this line returns the cleaned text to whatever called the function
    
# Process a text
raw_text = "  HELLO WORLD!  "
clean_text = clean_text(raw_text)
print(clean_text)  # Output: hello world!                                     
```

💡 Notice the indentation after the function definition. This tells Python what code belongs to the function.

⚠️ Anything that is not 'returned' from a function is lost when the function ends.

In [None]:
#try defining a function that takes two numbers as input and returns their sum.

#what happens if you do not include the return statement? How is this different from printing the result inside the function?

**Function with keyword arguments** -
```python
def preprocess_text(text, lowercase=True, remove_punctuation=False): #...this time, we take the text along with two optional keyword arguments.
    """Preprocess text with two options.
        - lowercase: Convert text to lowercase (default: True)
        - remove_punctuation: Remove punctuation from text (default: False)
    """
    if lowercase:                         #...if the 'lowercase' argument is True (default), we lowercase the text    
        text = text.lower()
    if remove_punctuation:                #...if the 'remove_punctuation' argument is True (default is False), we remove some common punctuation
        text = text.replace("!", "").replace("?", "").replace(".", "")
    return text

# Can call with keywords in any order
processed = preprocess_text("Hello World!", remove_punctuation=True, lowercase=True)
print(processed)  # Output: hello world
```
> 💡 Almost all functions will take an input, along with keyword arguments to customize their behavior. These (should) be documented in the docstring.

In [None]:
#What happens if you call the function with only the text argument? Why?

#What happens when you hover over the function name in a Jupyter notebook? Can you see the docstring we added? Trying doing the same for print().

## 5. Errors
> 💡 Errors are a normal part of programming - learning to read the error message and the 'stack trace' (i.e., which part of the code broke) is important.

### **Reading Error Messages**
```python
# Error messages tell you:
# 1. What type of error occurred
# 2. Where it happened (line number)

text = "Natural Language Processing"
print(text[100])  # IndexError: string index out of range
#     ^^^^^^^^^
#     This points to exactly where the problem is!
```

In [None]:
#try to print a non-existent variable (e.g., print(nothing)). Identify the type of error you get, and where the error occured.

### **Common Error Types**

**SyntaxError** - there is a problem with the structure of your code
```python
# e.g., missing colon
if True
    print("This will cause a SyntaxError")

# e.g., unmatched parentheses
print("Hello world!"
```

**NameError** - you are trying to use an object that hasn't been defined
```python
print(undefined_variable)  # NameError: name 'undefined_variable' is not defined
```

**TypeError** - you are trying to perform an operation on incompatible data types
```python
text = "Hello"
number = 5
result = text + number  # TypeError: can't concatenate str and int
```

**IndexError** - you are trying to access an index that doesn't exist in an iterable
```python
words = ["apple", "banana"]
print(words[5])  # IndexError: list index out of range
```

**KeyError** - you are trying to access a dictionary key that doesn't exist
```python
person = {"name": "Alice", "age": 25}
print(person["height"])  # KeyError: 'height'
```

### **Handling Errors with Try/Except**
> 💡 try/except blocks allow you to 'try' code that you know might fail, and if it does - rather than crash - it will move to an 'except' block and run that instead.

```python
documents = ["Document 1", "Document 2"]

try:
    doc = documents[5]  # This will cause an IndexError
    processed = doc.lower()
except IndexError:
    print("Document index doesn't exist!")
    doc = "Default document"
    processed = doc.lower()

print(processed)
```
> ⚠️ While useful, it is best to avoid using try/except blocks unless absolutely necessary. This is because they 'swallow' errors that would otherwise help you identify problems in your code - problems that might become apparent later on, when your code is more complex and harder to debug. In other words - it is sometimes better to fix the root cause.

## 6. BONUS - Advanced Python Features
> 💡 These are more advanced techniques that will make your code more concise and efficient.

### **List Comprehensions**
> 💡 Essentially: a more compact version of a loop that creates or modifies lists. Very useful for text processing!

> ⚠️ Most useful for simple transformations. For more complex logic, use regular loops as they are easier to read.


In [None]:
#Imagine, you have a list of words, and you want to create a new list with all the words in uppercase.
#...you could do this with a for loop (and it works fine):
words = ["hello", "world", "python", "nlp"]
uppercase_words = []
for word in words:
    uppercase_words.append(word.upper())
print(uppercase_words)

In [None]:
#or you could use a list comprehension:
uppercase_words = [word.upper() for word in words]
print(uppercase_words)

#This is equivalent to the for loop above, but in a single line of code!

#Some further examples -

# Clean and normalize strings:
messy_texts = ["  Hello  ", "WORLD!", "  python  "]
clean_texts = [text.strip().lower() for text in messy_texts]
print(clean_texts)  # ['hello', 'world!', 'python']

# Get only long words (>4 characters) and make them uppercase
words = ["cat", "python", "dog", "machine", "learning"]
long_words = [word.upper() for word in words if len(word) > 4]
print(long_words)

### **Dictionary Comprehensions**
> 💡 These are exactly the same as list comprehensions, but for creating/modifying dictionaries! (i.e., they iterate key-value pairs).


In [None]:
#Imagine, you have a list of words, and you want to create a dictionary that maps each word to its length.
#...you could do this with a for loop (and it works fine):
words = ["apple", "banana", "cherry"]
word_lengths = {}
for word in words:
    word_lengths[word] = len(word)
print(word_lengths)

#...or you could use a dictionary comprehension:
word_lengths = {word: len(word) for word in words}
print(word_lengths)

#Some further examples -

# Create word frequency mapping
text = "python is great python is powerful"
words = text.split()
word_freq = {word: words.count(word) for word in set(words)}
print(word_freq)

# Convert text to uppercase keys with original as values
texts = ["Hello", "World", "Python"]
text_mapping = {text.upper(): text for text in texts}
print(text_mapping)

### **Generators**
> 💡 Generators create items on-demand instead of storing everything in memory. Great for processing large datasets!

**Basic generator** - Use `yield` instead of `return`
```python
def text_processor():
    """Generator that yields processed text one at a time."""
    texts = ["  HELLO  ", "  WORLD  ", "  PYTHON  "]
    for text in texts:
        yield text.strip().lower()

# Use the generator
processor = text_processor()
for clean_text in processor:
    print(clean_text)
# Output: hello, world, python (one at a time)
```
> ⚠️ While generators are very memory efficient, they can only be iterated once. After that, they are 'exhausted' and you need to create a new one.
