# 📘 Introduction to Python Programming and Pandas (Pre-Tutorial Guide)

Welcome to the Data Science course!

This notebook is designed to give you a **gentle introduction to Python programming** and an **overview of basic data analysis using Pandas**. It is intended to help you get started with the tools and concepts you’ll be using throughout the course.


## 🗓️ When to Complete This

You are expected to go through and understand this notebook **during the first week of the course i.e. 7-11 April**. During the first tutorial on 15th April, you’ll solve hands-on exercises from another tutorial that build on the concepts introduced in this notebook and in the assigned DataCamp sessions.

This notebooks is **not part of the formal teaching time** and contains **no exercises** or assessments. It is meant to be solved as a take-home exercise during the first week of the course.


## 📚 What This Notebook Covers

- Basic Python syntax (variables, data types, conditionals, loops, functions)
- Working with lists and dictionaries
- Reading and exploring data with Pandas
- Accessing and querying data with Pandas

No prior programming experience is required — this notebook is beginner-friendly and interactive.

## ✅ What You Need to Do

1. **Work through this notebook at your own pace.** Try out the code examples, and make sure you understand what each line is doing.
2. **Complete the assigned DataCamp sessions** listed on the canvas platform. These will reinforce what you learn here.
3. **Bring your questions to the lecture or tutorial session!** The tutorial will focus on solving problems using what you’ve learned, so make sure you’re familiar with the basics beforehand.

> 💡 Tip: Feel free to experiment with the code as you go! Modify examples, try new things, and make mistakes — it’s the best way to learn.

<br>
<br>


**Note that programming follows a different approach compared to other disciplines or courses. Programming isn't learned by reading books and expecting to know everything, but rather by putting it into practice. It is a highly iterative learning process. Even the most experienced programmers get stuck all the time and then have to consult the documentation of the used programming libraries (e.g., `pandas`), turn to online communities like `StackOverflow`, or ask `ChatGPT`. When you use such online resources, MAKE SURE THAT YOU UNDERSTAND WHY YOU DO WHAT YOU DO. For example, when using ChatGPT, then ask it to explain the outputs to you. If you don't do this, your learnings from this course will be very limited.**

# Using Notebooks

Notebooks are a great way to write and share code. They are interactive documents that combine code, text, images, and other media. In this course, we will use notebooks to write our code and share our results.

A notebook is made up of cells. Each cell can be either a code cell or a text cell. A code cell contains code that can be executed. A text cell contains text that can be formatted using [Markdown](https://colab.research.google.com/notebooks/markdown_guide.ipynb). You can run a cell by pressing `Shift+Enter` or by clicking the `Run` button next to the cell.

This is a text cell. You can edit it by double-clicking on it. You can format the text using Markdown. For example, you can make text bold or italic. You can also create headings, lists, links, images, and more. 

**You can find a Markdown tutorial here: https://colab.research.google.com/notebooks/markdown_guide.ipynb** 


In [None]:
# This is a code cell with a comment. Comments are lines that start with a # and are ignored by the Python interpreter.
print("Hello, World!")

After you run a code cell, the output will be displayed below it. Once you save the notebook, the output will be saved as well. This is a great way of sharing your results with others!

You can clear the output by right clicking the output sell and selecting `Clear output`.

## Python Basics: Variables, Data Types and Operators

In Python, variables can store data of different types such as integers, floats, and strings. Here are some quick examples:

In Python, we can store data in variables. A variable is a name that refers to a value. For example, we can create a variable called `price` and assign it the value `15`:

In [None]:
price = 15

Variables in Python are dynamically typed. This means that we don't need to specify the type of the variable when we create it. Python will automatically infer the type of the variable based on the value that we assign to it. For example, if we assign the value `15` to the variable `price`, Python will infer that the type of the variable is `int` (integer). If we assign the value `15.5` to the variable `price`, Python will infer that the type of the variable is `float` (floating point number).

In [None]:
# Integers are numbers without decimal point.
price = 15
print(type(price))

# Floats are numbers with decimal point.

price = 15.5
print(type(price))

By default, Python has the following data types: Numbers, strings, booleans, lists, tuples, and dictionaries.

Numeric data types are used to store numbers. Python supports two types of numeric data types: integers and floats. Integers are whole numbers, while floats are numbers with a decimal point.

You can perform many operations with numeric variables: Addition, Subtraction, Multiplication, Power, Division, and Modulo on Integers and Floats.

Let's try to write an example for each of these operation in the following cell with a new variable called `amount`

In [None]:
amount = 20

In [None]:
# Addition
print(amount + 10)

# Subtraction
print(amount - 10)

# Multiplication
print(amount * 10)

# Division
print(amount / 10)

# Power
print(amount ** 2)

# Modulo
print(amount % 10)

## Strings

Strings are used to store text. Strings are created by enclosing characters in quotes. For example, we can create a string variable called `name` and assign it the value `"John"`

In [None]:
name = "John"
print(name)

You can use both single quotes (`'`) and double quotes (`"`) to create strings. However, you must use the same type of quotes to start and end the string. For example, the following code will produce an error:

In [None]:
name = "John'

In [None]:
# Using quotes in a string
print("John's bill is", price)


However, you must use the same type of quotes to start and end the string. The following code will produce an error. Can you explain why this error occurs?

In [None]:
# What happens if you use single quotes instead?
print('John's bill is', price)

**Strings are sequences of characters. You can access individual characters using the square brackets (`[]`) operator. Note that the first character has index `0`:**

In [None]:
# Printing the first character of a string
print(name[0])
# Printing the last character of a string
print(name[-1])

You can also access substrings using the square brackets (`[]`) operator. The syntax is `string[start:end]`, where `start` is the index of the first character in the substring and `end` is the index of the first character after the substring. This operation is called **slicing**. Note that the character at index `end` is not included in the substring:

In [None]:
full_name = "Marcus Aurelius Antoninus"
print("The first name is:", full_name[0:6])
# Now use slicing to print the middle and last name
print("The middle name is:", full_name[7:15])
print("The last name is:", full_name[16:])

## Booleans

Booleans are used to store logical values. In Python, booleans are `bool` type. Booleans are created by using the `True` and `False` keywords:

In [None]:
a = True
print(type(a))

Booleans are the result of logical operations. For example, you can use the `==` operator to check if two variables are equal.

Booleans can also be used in conditional statements. We will cover this later.

Now let's run the following statements and check if the output is correct or not for each condition.

In [None]:
print(name == "John")
print(price > 10)
print(1 != 1)

## Lists

Lists are used to store a sequence of values. Lists are created by enclosing values in square brackets (`[]`). For example, we can create a list variable called `prices` and assign it the values `[15, 20, 25]`:


In [None]:
prices = [15, 20, 25]
print(prices)

Lists can also store variables and values of different types. Let's create a list called , `x` which contains variables and values of different data types.

In [None]:
x = [10, name, "abc", True, 3.14]
print(x)

**You can access elements of a list using the square brackets (`[]`) operator. Note that the first element has index `0`. Slicing also works with lists. What do the following two statements do ?**

In [None]:
print(x[-1])
print(x[1:3])

You can use the `append()` method to add elements to the list. For instance, if we want to add the integer `100` to the list in the last position, then we will use the following statement.

In [None]:
x.append(100)
print(x)

You can remove elements from a list using the remove() method. Let's remove the element `"abc"` from the list `x`.

In [None]:
x.remove("abc")
print(x)

To add an element to a specific position in a list, you can use the insert() method. For instance, let's insert the element `"abc"` again to the list in position `2` (remember the index in python always starts from 0, so position 2 is actually the `3rd` position.

In [None]:
x.insert(2, "abc")
print(x)

You can change the value of an element in a list by using the square brackets (`[]`) operator. Let's change the value of the first element (`0`th position) to `-100`.

In [None]:
x[0] = -100
print(x)

So, now `10` is replaced by `-100`.

## Dictionaries

Dictionaries are used to store key-value pairs. Dictionaries are created by enclosing key-value pairs in curly brackets (`{}`). For example, we can create a dictionary variable called `menu` and assign it the following key-value pairs. These could indicate the price of each food item in the `menu`. 

In [None]:
menu = {"pizza": 15, "pasta": 20, "salad": 10}
print(menu)

You can access the value of a key using the square brackets (`[]`) operator. If the key does not exist, you will get an error:

In [None]:
print(menu["pizza"]) # prints the value (price) of the food item , i.e. pizza

In [None]:
print(menu["kebab"]) # can you explain why you get an error when you run this?

You can add new key-value pairs to a dictionary using the square brackets (`[]`) operator. If the key already exists, the value will be updated.

In [None]:
menu["kebab"] = 25
menu["pasta"] = 25
print(menu)

## Introduction to Functions

Functions allow you to package code for reuse. 

## Built-in Functions

Built-in functions are functions that are already defined in Python. For example, the `print()` function is a built-in function. You can find a list of all built-in functions in the [Python documentation](https://docs.python.org/3/library/functions.html).

Let's check out some of the built-in functions in Python.

**TIP**: You can use the `help()` function to get more information about a function. For example, you can use `help(print)` to get more information about the `print()` function.

In [None]:
help(print)

### Type conversion functions

Python has several built-in functions to convert data types. For example, you can use the `int()` function to convert a float to an integer. You can use the `str()` function to convert a number to a string. 

In [None]:
price = 15
print(float(price))

In [None]:
price = 18.75
print(int(price)) #convert price to integer
print(str(price)) #convert price to string

### Sequence functions

Python has several built-in functions to work with sequences. For example, you can use the `len()` function to get the length of a sequence.

`sorted()` sorts a sequence. 

`max()` and `min()` can be used to get the maximum and minimum values in a sequence, while `sum()` can be used to get the sum of all the elements in a sequence.

Let's create a list called `prices` and run some sequence functions on it and check the outputs.

In [None]:
prices = [15, 5, 25, 30, 8, 50, 6, 14, 63, 10]

In [None]:
print(len(prices)) # length of the list
print(max(prices)) # maximum value in the list
print(min(prices)) # minimum value in the list
print(sum(prices)) # sum of all elements in the list
print(sorted(prices)) # sorting the list in ascending order
print(sorted(prices, reverse=True)) # sorting the list in descending order

## Libraries

Libraries are collections of functions that extend the functionality of Python. For example, the `math` library contains functions to perform mathematical operations. You can import a library using the `import` keyword. For example, we can import the `math` library using the following code:

In [None]:
import math

print(math.pi)

You can import specific functions from a library using the `from` keyword. For example, we can import the `sqrt()` function from the `math` library using the following code:

In [None]:
from math import sqrt

print(sqrt(25))

## User-defined Functions

Python allows you to create your own functions using the `def` keyword. Functions help organize code into reusable blocks, making your code easier to read and maintain.

To return a value from a function, use the `return` keyword. If a function does not explicitly return a value, it will automatically return None.

Even without a `return` statement, a function can still perform actions or computations internally.


### Example

Let's define a simple function named `square` that returns the square of a given number:

In [None]:
# Defining a simple function using def
def square(x):   # this is a function defintion
    return x * x

### Important: Function Call

Defining a function alone does not execute it. You need to explicitly call the function by writing its name followed by parentheses ( ) and passing any required arguments:


In [None]:
square(5)  # this is a function call where the function gets executed

### Lambda Functions

Lambda functions (also called anonymous functions) in Python are small, inline functions defined using the `lambda` keyword. Unlike user-defined functions defined using def, lambda functions don't have a name and typically consist of a single expression.

The syntax of `lambda` function is as follows:

`lambda arguments: expression`

In [None]:
# Lambda function example to square a number
square_lambda = lambda x: x * x          #function defintion

square_lambda(6)  # function call

### Conditionals

Conditionals are used to execute code only if a certain condition is true. In Python, we use the `if` statement to create conditionals.

In [None]:
price = 15

if price > 10:
    print("The price is greater than 10")

You can have multiple conditions in an `if` statement. You can use the `elif` keyword to add more conditions. You can use the `else` keyword to execute code if none of the conditions are true.

In [None]:
price = 10

if price > 10:
    print("The price is greater than 10")
elif price < 10:
    print("The price is less than 10")
else:
    print("The price is exactly 10")

## Loops

Loops are used to execute a block of code multiple times. `For` loops are used to iterate over a sequence (list, tuple, string, dictionary, set, or range). For example, we can use a for loop to print all the elements in a list:

In [None]:

list_l1 = [-100, 'John', 'abc', True, 3.14, 100]


for k in list_l1:
    print(k)



In this example:

- We have a list `list_l1` containing elements of different data types: integer (-100, 100), strings ('John', 'abc'), boolean (True), and float (3.14).

- The `for` loop iterates over each element in the list.

- During each iteration, the current element is assigned to the loop variable `k`, and `print(k)` outputs the current value.


Similarly, we can use `for` loop to iterate over dictionaries. See the example below and then the explanation.

In [None]:
dict_menu = {"pizza": 15, "pasta": 20, "salad": 10}
for item, price in dict_menu.items():
    print(item, price)


In this example:

- We have a dictionary named `dict_menu`, where each key is a food item (like "pizza") and its corresponding value is the price.

- The method `items()` is used to iterate over key-value pairs.

- In each iteration, the loop assigns the key to `item` and the value to `price`.

- `print(item, price)` displays each food `item` (key) and its `price` (value).

# Getting Started with Pandas

**Pandas** is not a built-in Python library, so you'll need to install it before using it. If you're using **Google Colab**, Pandas comes pre-installed. However, if you're working on your local computer, you can install it using Python's package manager, `pip`.

Run the following command directly in your notebook cell to install Pandas:

```python
!pip install pandas
```

- The `!` character instructs the notebook to execute this command directly in the terminal.
- `pip` is Python's built-in package manager, commonly used for installing Python libraries. Typically, you install a library with the command `pip install <package_name>`.

Once Pandas is installed, you can import it in your notebook as follows:

```python
import pandas as pd
```

Here, `pd` is the standard alias for Pandas, commonly used because it shortens the code and clearly identifies Pandas functions and methods within your scripts.


## Basic Definitions

The two primary data structures in **Pandas** are: **Series** and **DataFrame**.

### Series

A `Series` is a one-dimensional labeled array capable of holding any data type (integers, strings, floats, Python objects, etc.). You can think of it as a single column in a spreadsheet or table.


```python
pandas.Series(data=None, index=None, ...)
```

- **`data`**: The data to store in the Series (e.g., a list, array, or scalar value).
- **`index`**: Optional labels for each element in the Series. If not specified, Pandas will automatically assign a numeric index starting from 0.

Example:
```python
import pandas as pd

data = [10, 20, 30]
labels = ['a', 'b', 'c']

series = pd.Series(data, index=labels)
print(series)
```

**Output:**
```
a    10
b    20
c    30
dtype: int64
```

In [None]:
import pandas as pd

data = [10, 20, 30]
labels = ['a', 'b', 'c']

series = pd.Series(data, index=labels)
print(series)

### DataFrame

A `DataFrame` is a two-dimensional, tabular data structure made up of multiple `Series` that share a common index. You can think of it like an Excel spreadsheet or SQL table.

### Example: Creating a DataFrame from a Dictionary

Let's say we run a fruit stand that sells **apples** and **oranges**. Each row in our data will represent a customer's purchase — how many apples and oranges they bought.

We can organize this data as a dictionary where:
- Each **key** represents a fruit (a column in the table).
- Each **value** is a list of quantities sold to each customer.

Here's how we can create a DataFrame from that dictionary:

```python
import pandas as pd

# Dictionary where keys are column names and values are lists of data
fruit_data = {
    "apples": [3, 2, 0, 1], 
    "oranges": [0, 3, 7, 2]
}

# Create DataFrame and assign custom row labels for each customer
purchases = pd.DataFrame(fruit_data, index=["Customer 1", "Customer 2", "Customer 3", "Customer 4"])

print(purchases)
```

**Output:**
```
            apples  oranges
Customer 1       3        0
Customer 2       2        3
Customer 3       0        7
Customer 4       1        2
```

This table (DataFrame) clearly shows how many apples and oranges each customer bought, and it's now in a format that's easy to analyze using Pandas.


In [None]:
fruit_data = {
    "apples": [3, 2, 0, 1], 
    "oranges": [0, 3, 7, 2]
}

purchases = pd.DataFrame(fruit_data, index=["Customer 1", "Customer 2", "Customer 3", "Customer 4"])

purchases

### How Did That Work?

When creating a `DataFrame` from a dictionary, each **key** becomes a **column name**, and each **value** (usually a list or array) becomes the **column data**.

Here’s the general constructor for a DataFrame:

```python
pandas.DataFrame(data=None, index=None, columns=None, dtype=None, ...)
```

#### Explanation of Parameters:

- **`data`**: The input data to populate the DataFrame. It can be:
  - A dictionary (like in our fruit example)
  - A list of lists
  - A NumPy array
  - Another DataFrame

- **`index`**: (Optional) Custom row labels. If not provided, Pandas uses default integer indices starting from 0.

- **`columns`**: (Optional) Column labels. If you're passing a dictionary (which already has keys as column names), you typically don't need to set this. 

- **`dtype`**: (Optional) The data type to force for all columns. Use this if you want to convert all data to a specific type (e.g., `float`, `int`, `str`).

This flexibility is one of the reasons Pandas is so powerful for working with structured data.


# Loading DataFrames

Often, you won't be creating DataFrames from scratch. Instead, you will be loading them from files. Pandas can read a variety of file types using its `pd.read_` functions. CSV files are one of the most common, so we will start there.


We will be working with the IMDB Movie Dataset which has already been uploaded on a GitHub repository at this URL: https://raw.githubusercontent.com/rnanda17/data_science_BE/refs/heads/main/IMDB-Movie-Data.csv . Pandas is able to access public URL link directly and load the CSV files as DataFrames in the Python Notebook. But please note that these should be raw files only.

## Loading CSV Files

CSV stands for "comma-separated values". CSV files are a common way to store tabular data. They are plain text files with a specific structure. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The first line of the file usually contains the names of each column.

With CSV files, all you need is a single line of code:
`pandas.read_csv(filepath)`

There are many other arguments you can pass to `read_csv`, but we will only use the `filepath` for now.

Use the help function to learn more about `read_csv`:

```help(pd.read_csv)``` or ```pd.read_csv?```

In [None]:
#The default value is index_col=None 
movies_df = pd.read_csv("https://raw.githubusercontent.com/rnanda17/data_science_BE/refs/heads/main/IMDB-Movie-Data.csv")
movies_df


In [None]:
#If we set index_col=0, we're explicitly stating to treat the first column as the index:
movies_df = pd.read_csv("https://raw.githubusercontent.com/rnanda17/data_science_BE/refs/heads/main/IMDB-Movie-Data.csv", index_col=0)

movies_df

### Note about other ways to work with CSV files with Python notebooks

If you are not working on Colab, then you can also download the dataset from the URL: https://github.com/rnanda17/data_science_BE/blob/main/IMDB-Movie-Data.csv. 

Let's first download and import the dataset we will be working with. First, download the file `IMDB-Movie-Data.csv` from Canvas. 

Then, upload it to your Colab notebook by clicking on the folder icon on the left side of the screen. Click on the `Upload` button and select the file. You should see the file in the file explorer on the left side of the screen. If you right click on the file, you can copy its path by clicking on `Copy Path`.

You can also sync your Google Drive with Colab. To do this, click on the folder icon on the left side of the screen. Click on the `Mount Drive` button. You will be prompted to authenticate your Google account. Once you do that, you will be able to access your Google Drive files from Colab. You can upload the file to your Google Drive and then access it from Colab.

If you are using a local Jupyter notebook, you can just save the file in the same directory as your notebook.

# Exploring your DataFrame

Now let's learn some ways to explore your DataFrame. First, let's see some methods for checking the data within the DataFrame.

## Accessing Data

`DataFrame.head(n)` returns the first n rows of the DataFrame.

In [None]:
movies_df.head(5)

`DataFrame.tail(n)` returns the last n rows of the DataFrame.

In [None]:
movies_df.tail(5)

### Accessing Columns

You can access a specific column by using the following syntax:

In [None]:
movies_df["Title"]
# Or
movies_df.Title

The square brackets + string with column name syntax works for any column name.

The `.` notation only works if the column name is a valid Python variable name. For instance, `df.Movie Title` will not work, but `df.movie_title` will. 

To make sure your code always works, you can use the first one with square brackets + string syntax.

### Accessing Rows

You can access a specific row in two ways:

- `df.loc` - locates by index name (row label). In our case, the index is the ranking of the movie.
- `df.iloc` - locates by numerical index (row number).

Note that `df.loc` and `df.iloc` are not methods, but attributes. This means that you don't use parentheses to call them. You just use them like this: `df.loc[1]` or `df.iloc[1]`.

The arguments for `df.loc` can also be a list of indices. For example, `df.loc[[1, 2, 3]]` will return the top three ranked movies.

Remember that DataFrame indices can also be strings.

In [None]:
# Returns the movie with Rank 1
movies_df.loc[1]

In [None]:
# Returns the movie in the first row
movies_df.iloc[0]

`df.loc` and `df.iloc` also work for accessing columns. For example, `df.loc[1, 'Title']` will return the title of the movie with index 1.

You can access all rows for a specific column by using `:` as the first argument. For example, `df.loc[:, 'Title']` will return all the titles.

## Conditional Selection

Pandas makes it easy to select rows based on a condition. For example, if we want to select all the movies with a rating of 8.5 or higher, we can do the following:

In [None]:
movies_df[movies_df["Rating"] >= 8.5]

You can combine multiple conditions using the operators `&` (and) and `|` (or). For example, if we want to select all the movies with a rating of 8.5 that came out after 2009, we can do the following:

In [None]:
movies_df[(movies_df["Rating"] >= 8.5) & (movies_df["Year"] >= 2010)]

### Querying

You can also use the `DataFrame.query` method to select rows based on conditions. For example, if we want to select all the movies with a rating of 8.5 or higher that were released after 2010, we can do the following:

In [None]:
movies_df.query("Rating >= 8.5 and Year > 2010")

Both methods work the same way. The only difference is that `DataFrame.query` is more convenient to use when you have a lot of conditions because the syntax is more compact.

You can find a nice list of examples of query examples https://sparkbyexamples.com/pandas/pandas-dataframe-query-examples/ .


## Counting

There are different levels of counting we can do with Pandas. 

We can count the number of rows in a DataFrame, the number of unique values in a column, the number of times a specific value appears in a column, and even count values within a group.

### Counting Rows

The simplest way to count the number of rows in a DataFrame is to use the `len()` function. This function works on any Python object, not just DataFrames.

We can also use the `shape` attribute of a DataFrame, which returns a tuple with the number of rows and columns in the DataFrame.

In [None]:
print(len(movies_df))
print(movies_df.shape)