# Python for Data Science

**Chapter 1: Getting Started in Python**

- Instructor: [Alier Reng](https://alierwai.org/about/), Alierwai DataStudio
---

## Introduction
Data science, a multidisciplinary field that focuses on extracting insights from data, relies heavily on Python, R, and other programming languages. Python is the primary language used in this course due to its simplicity and robust libraries. This introductory data science course with Python will cover fundamental concepts, data manipulation with Pandas, and data visualization using Matplotlib and Seaborn. Learners will delve into data cleaning, preprocessing, and exploratory data analysis to prepare data for analysis. The course equips learners with a solid foundation to embark on a data science journey and apply their skills to real-world challenges.

Upon completing this course, learners will be proficient in performing analyst tasks such as importing data with pandas, handling missing values, tidying and transforming data for further exploratory analysis, and effectively communicating their results to non-technical audiences.

## Chapter 1 Objective

Upon completing this chapter, learners will be proficient in writing Python data structures and executing fundamental operations, including for loops, set operations, and writing robust functions. Additionally, they will be adept at working comfortably within either VS Code, Jupyter Notebook, or JupyterLab.

## 1. Setup

### 1.1 Installing Anaconda

Instructions will be added at a later date.

### 1.2 Installing and Customizing VS Code

To enable Python and Quarto, the following extensions are required:

**2.1 IntelliSense (Pylance)**:

A Visual Studio Code extension that offers comprehensive support for the Python language (for all actively supported versions: >=3.7). It includes features like IntelliSense (Pylance), linting, debugging, code navigation, code formatting, refactoring, variable explorer, test explorer, and more.

**2.2 Python Indent**

Ensures correct Python indentation in Visual Studio Code. You can find the extension on the VSCode Marketplace and review its source code on GitHub.

**2.3 Python Extension Pack**

A collection of Python extensions, including autodocstring, IntelliSense, Jinja, Django, Intellicode, and Python Environment Manager.

**2.4 Python Environment Manager**

Offers seamless management of Python environments.

**2.5 Quarto Extension**

Quarto is an open-source scientific and technical writing software developed by Posit (formerly RStudio).

**2.6 Markdown Preview Enhanced**

Markdown Preview Enhanced is an extension that provides you with many useful functionalities such as automatic scroll sync, math typesetting, mermaid, PlantUML, pandoc, PDF export, code chunk, presentation writer, etc. A lot of its ideas are inspired by Markdown Preview Plus and RStudio Markdown.

**3. Jupyter Notebook Overview**

In this course, we will utilize Quarto instead of Jupyter Notebook or JupyterLab due to its distinctive capabilities. However, students are free to choose their preferred tool. Below are keyboard shortcuts for those who opt to use either Jupyter Notebook or JupyterLab :

- Press 'a' to add a cell above the current cell.

- Press 'b' to create a cell below the current cell.

- Press 'x' to cut the cell.

- Press 'c' to copy the cell.

- Press 'v' to paste the copied content.

- Press 'ESC + M' to switch the cell to Markdown Mode.

- Press 'ESC + Y' to switch the cell to Code Mode.

- Press 'TAB' to auto-complete what you're typing.

- Press 'ESC + I, I' to interrupt the kernel.

- Press 'ESC + 0, 0' to restart the kernel.

- Press 'Ctrl + Return' to execute the cell.

- Press 'Shift + Return' to execute the cell and add a cell below it.

- Press 'Shift + Tab' to bring up Tooltip and press 'ESC' to undo it.

## 1.2 Fundamentals of Python

In this section, students will acquire foundational knowledge of basic Python data structures and essential tools crucial for their journey in data science. Specifically, this section will cover key aspects, including control flow methods, data structures, and functions.

### Control Flow Methods

Control flow encompasses programming syntax that directs program execution. It enables dynamic adaptations based on the program's state or input, influencing output changes (Farrell et. all, 2020).

#### if Statements

Conditionals, often implemented as if statements, represent a prevalent form of control flow. They serve to assess the truth or falsehood of a given condition. if statement is expressed as follows in Python:


In [37]:
print("""if [condition to check]:
    do something
    """
)

if [condition to check]:
    do something
    


**Example 1**: Let's determine whether the given number is even or odd.

In [87]:
# Check if a number is even or odd:
for x in range(0, 10):
    if x % 2 == 0:
        print(f'{x = } is even.')
    else:
        print(f'{x = } is odd.')

x = 0 is even.
x = 1 is odd.
x = 2 is even.
x = 3 is odd.
x = 4 is even.
x = 5 is odd.
x = 6 is even.
x = 7 is odd.
x = 8 is even.
x = 9 is odd.


In [95]:
y = 7
if y < 8:
    print(f"{y = }")
else:
    print("y is less than 8.")

y = 7


**Example 2**: Let's determine whether the given number is divisible by 3, 4, and 6.

***Note:*** Because x % 3 == 0 meets the condition, the remaining checks are not performed.

In [96]:
x = 150

if x % 3 == 0:
    print(f'x is divisible by 3')
elif x % 4 == 0:
    print(f'x is divisible by 4')
elif x % 6 == 0:
    print(f'x is divisible by 6')
else:
    print(f'x is not divisible by 3, 4, or 6')

x is divisible by 3


In [99]:
x = 15

if x % 2 == 0:
    print(f'{x = } is divisible by 2')
elif x % 3 == 0:
    print(f'{x = } is divisible by 3')
elif x % 6 == 0:
    print(f'{x = } is divisible by 6')
else:
    print(f'{x = } is not divisible by 2, 4, or 6')

x = 15 is divisible by 3


In [117]:
x = 240
if x % 3 == 0 or x % 5 == 0 or x % 6 == 0:
    print(f"{x = } is divisible by 3, 4, and 6. Hooray!!")
else:
    print(f"{x = } is not divisible by all of these numbers.")

x = 240 is divisible by 240. Hooray!!


## Loops

Loops are another example of a control flow method used extensively. There are two types of loops: while loops and for loops. The following examples illustrate the usage of each.

**Example 3**: A while loop checks to determine whether the execution fulfills a given condition, and as long as the condition remains true, the loop continues to iterate.

In [118]:
# Using a while loop:
y = 0
while y < 10:
    print(y)
    y += 1

0
1
2
3
4
5
6
7
8
9


**Example 4**: A for loop is employed to iterate through a given sequence of values.

In [119]:
# Using a for loop:
for i in range(10):
    if i % 2 == 0:
        print(i)

0
2
4
6
8


In [121]:
for n in range(5):
    if n % 3 == 0:
        print(n**2)

0
9


**Example 5**: Check if Jok Gai and Alier Pach is in the list

In [125]:
my_list = ['alier reng', 'jok gai', 'deng mach', 'alier pach']

for name in my_list:
    if name == 'jok gai' or name == 'alier pach':
        print(f'Hello, {name.title()}! Welcome to Python for Data Science.')

Hello, Jok Gai! Welcome to Python for Data Science.
Hello, Alier Pach! Welcome to Python for Data Science.



### Data Structures

Data structures are specialized formats for organizing and storing data to facilitate efficient operations like insertion, retrieval, and deletion. They define how data is arranged, stored, and manipulated within a computer's memory. The choice of a data structure depends on the specific requirements of the task, considering factors such as the type of data, the operations to be performed, and the efficiency of those operations.

Common data structures include strings, arrays, lists, sets, dictionaries, and tuples. Each has unique characteristics and advantages, making them suitable for different computer science, programming, and data management scenarios. Selecting an appropriate data structure is crucial for optimizing algorithms and improving the overall performance of software systems.

#### Strings

Strings are sequences of characters, immutable in nature, and offer various methods for manipulation. String formatting options are %-formatting, str.format(), and f-strings. We will only learn about f-strings in this course.

Example 6: f-strings

In [47]:
a = "Hello, friends!"
b = " Welcome to the Data Science Community!"

print(f'{a + b}')

Hello, friends! Welcome to the Data Science Community!


In [126]:
# %-formatting example:
name = "Akuien"
age = 3
height = 3.2

# Using %-formatting
formatted_string = "My name is %s! I'm %d years old, and I'm %.2f feet tall." % (name, age, height)

print(formatted_string)


My name is Akuien! I'm 3 years old, and I'm 3.20 feet tall.


In [128]:
my_string = "Alierwai DataStudio"
for char in my_string:
    if char == char.upper().strip():
        print(f'{char} is a capital letter!')

A is a capital letter!
D is a capital letter!
S is a capital letter!


In [50]:
name = 'akuien'
age = 3
print(f'{name.title()} is {age} years old! He is now a big boy.')

Akuien is 3 years old! He is now a big boy.


**Example 6**: Numeric formatting options for strings

In [131]:
from math import pi, sin, cos
print(f'pi = {pi:.4f}, rounded to four decimal places.')

# Add a space between these outputs
print(' ')

print(f'sin(pi) = {sin(pi):.2f} and cos(pi) = {cos(pi):.2f}')

pi = 3.1416, rounded to four decimal places.
 
sin(pi) = 0.00 and cos(pi) = -1.00


In [132]:
# Displaying the current time:
from datetime import datetime as dt
print(f'The current time is {dt.now():%H:%M:%S}, EDT.')

The current time is 13:09:23, EDT.


In [137]:
# Using the f-strings formatting:
my_books = [(1, "Kamusi Ya Kiswahili Sanifu", "43.98"), (3, "R 4 Data Science", "56.87"), (10, "Effective Pandas'", "34.09")]
for i, title, cost in my_books:
    print(f"{i:02}. {title:>25} is ${cost} on amazon.com.")

01. Kamusi Ya Kiswahili Sanifu is $43.98 on amazon.com.
03.          R 4 Data Science is $56.87 on amazon.com.
10.         Effective Pandas' is $34.09 on amazon.com.


In [142]:
# Using the f-strings formatting:
my_books = [(1, "Kamusi Ya Kiswahili Sanifu", "43.98"), (3, "R 4 Data Science", "56.87"), (10, "Effective Pandas", "34.09")]
for i, title, cost in my_books:
    print(f"{i:02}. {title:.<26} is ${cost} on amazon.com.")

01. Kamusi Ya Kiswahili Sanifu is $43.98 on amazon.com.
03. R 4 Data Science.......... is $56.87 on amazon.com.
10. Effective Pandas.......... is $34.09 on amazon.com.


#### Lists

Python lists stand out as the most commonly used data structure, offering the flexibility to contain a mix of numbers, strings, and tuples, e.g., `my_list = [1, 2, 'a', 'x', (1, 2, 3)]`.

**Example 7**: When working with Python lists, elements can be added in two ways: append() inserts a single element at the end of a list, while list concatenation combines two lists.

In [61]:
my_list = [1, 2, 'a', 'x', (1, 2, 3)]

# Add 4 to my_list
my_list.append(4)
print(f'My updated list is {my_list}')

# Create a new list
b = [5, 12, -8]

print(f'My concatenated list is {my_list + b}')

My updated list is [1, 2, 'a', 'x', (1, 2, 3), 4]
My concatenated list is [1, 2, 'a', 'x', (1, 2, 3), 4, 5, 12, -8]


You can remove an element from a list with the pop() method using the index of the element being removed.

In [62]:
# Remove -8 from b above
b.pop(2)

-8

Now, let's create a new by modifying an existing list.

In [63]:
c = [4, 3, 8, 7, 10, 12]

d = [3 * i for i in c if i % 2 == 1]
print(d)

# Square even elements in list c
square_even_nums = [i**2 for i in c if i % 2 == 0]
print(f'square_even_nums are: {square_even_nums}')

[9, 21]
square_even_nums are: [16, 64, 100, 144]


**Example 8**: Working with multi-dimensional lists

In [64]:
multi_list = [[0, 2, 5], [8, 6, 4], [7, 12, 15]]

# Let's print a row: here, a sublist is assigned to a variable called row:
for row in multi_list:
    print(row)

[0, 2, 5]
[8, 6, 4]
[7, 12, 15]


In [67]:
# print out the first element in each sublist:
for row in multi_list:
    print(row[0])

0
8
7


In [68]:
# Print out the values of all the cells in multi_list by havin a nested for loop:
for row in multi_list:
    for element in row:
        print(element)

0
2
5
8
6
4
7
12
15


This is because the row index and the column index of a diagonal element in a table/matrix are equal.

In [69]:
# Print out the values of all the cells in multi_list by havin a nested for loop
# Print out the diagonal elements in a nicely formatted message. To do this, we can have an indexing variable i; loop from 0 t0 2
for i in range(3):
    print(multi_list[i][i])

0
6
15


Next, let's make the above a bit pretty using the f-strings.

In [64]:
# Print out the diagonal elements in a nicely formatted message. To do this, we can have an indexing variable i; loop from 0 t0 2:
for i in range(3):
    print(f'{i + 1}-th diagonal element is: {multi_list[i][i]}')

1-th diagonal element is: 0
2-th diagonal element is: 6
3-th diagonal element is: 15


In [70]:
# Print out the diagonal elements in a nicely formatted message. To do this, we can have an indexing variable i; loop from 0 t0 2:
for i in range(3):
    print(f'{i + 1:02d}-th diagonal element is: {multi_list[i][i]}')

01-th diagonal element is: 0
02-th diagonal element is: 6
03-th diagonal element is: 15


### Tuples

Tuples are similar to list; however, they are immutable objects in Python - they can't be changed.

**Example 9**:

In [71]:
# Try changing a tuple:
a = (0, 1, 2, 3)
a[2] = 5 # this throws an TypeError: 'tuple' object does not support item assignment

TypeError: 'tuple' object does not support item assignment

In [72]:
# Try appending 7 to a
a.append(7)

AttributeError: 'tuple' object has no attribute 'append'

In [73]:
# Tuple with integers:
int_tuple = (1, 2, 3)
print(int_tuple)

# tuple with mixed datatypes:
my_tuple = (1, "Hello", 3.4)
print(my_tuple)

# nested tuple:
nested_tuple = ("mouse", [8, 4, 6], (1, 2, 3))
print(nested_tuple)

(1, 2, 3)
(1, 'Hello', 3.4)
('mouse', [8, 4, 6], (1, 2, 3))


### Sets

A Python set is a collection of unordered elements. A set is initialized with curly brackets, and we can add elements to a set with the add() method.

***Note:*** Given that a set is a collection of Python elements, it can be iterated over with a for loop; however, the elements may not necessarily appear in the order they were initialized. Additionally, there is no effect when adding an already existing element to the set.

In [74]:
# Create a set: 
a = {4, 5, 7}

# Add 9 to a:
a.add(9)
print(a)

# Add 4 to a:
a.add(4)
print(f'This is our updated set a after adding 4 to it: {a}')

{9, 4, 5, 7}
This is our updated set a after adding 4 to it: {9, 4, 5, 7}


#### Set Operations

Set operations can be performed in Python with the union() and intersection() methods.

**Example 10**:

In [75]:
#Create set b:
b = {3, 7, 6}

# Perform a union:
print(f'The union of sets a & b is {a.union(b)}')

# Find the intersection of sets a & b:
print(f'The intersection sets a & b is {a.intersection(b)}')

The union of sets a & b is {3, 4, 5, 6, 7, 9}
The intersection sets a & b is {7}


In [76]:
# Remove an element from a set: use discard() or remove():
a = {4, 5, 7}
a.remove(4)
print(a)

print(' ')

b = {3, 7, 6}
b.discard(7)
print(b)

{5, 7}
 
{3, 6}


### Dictionaries
Python dictionaries are unordered collections of key-value pairs that provide fast lookups and retrievals based on keys.

In [77]:
# Let's a dictionary containing students' names mapped to their course grades
student_grades = {'akuien':97, 
                  'kiir':96, 
                  'garang':100,
                  'ayen':89
                  }

In the above example, the students' names represent the keys, and the grades represent the values.

In [78]:
student_grades['akuien']

97

In [79]:
# Example
print(f"Akuien scored {student_grades['akuien']}% in this test.")

Akuien scored 97% in this test.


As shown below, we can change a dictionary's existing keys and values.

In [80]:
# Change a dictionary's existing value
student_grades["ayen"] = 100

print(f"Ayen scored {student_grades['ayen']}% in this test.")

Ayen scored 100% in this test.


In [81]:
# Adding a new key and value to a dictionary
student_grades["atoch"] = 99

In [82]:
# Print the updated dictionary
print(student_grades)

{'akuien': 97, 'kiir': 96, 'garang': 100, 'ayen': 100, 'atoch': 99}


In [83]:
# A dictionary can also be initialized or declared as a list comprehension.
squared_nums = {i: i**2 for i in range(-5, 5)}

squared_nums

{-5: 25, -4: 16, -3: 9, -2: 4, -1: 1, 0: 0, 1: 1, 2: 4, 3: 9, 4: 16}

It is noteworthy that dictionary keys can only be created using immutable objects. This means tuples can be used as dictionary keys, but lists can't.

In [84]:
# Deleting a dictionary key-value:
del student_grades['ayen']

print(student_grades)

{'akuien': 97, 'kiir': 96, 'garang': 100, 'atoch': 99}


### for loops

In [85]:
# Let's loop through the student_grades:
for k, v in student_grades.items():
    print(k, v)

akuien 97
kiir 96
garang 100
atoch 99


In [100]:
# Let's make this pretty:
for k, v in student_grades.items():
    print(f"{k.title():.<15} {v}")

Akuien......... 97
Kiir........... 96
Garang......... 100
Atoch.......... 99


In [103]:
# Let's make this pretty:
for k, v in student_grades.items():
    print(f"{k.title():.>10} {v}")

....Akuien 97
......Kiir 96
....Garang 100
.....Atoch 99


Pairing two or more sequences with a `zip()` function.

In [109]:
# Create two sequences
questions = ['name', 'interest', 'favorite programming language']
responses = ['akuien', 'data science & computer science', 'python']
for q, r in zip(questions, responses):
    print('What is your {0}?  It is {1}.'.format(q, r))
    


What is your name?  It is akuien.
What is your interest?  It is data science & computer science.
What is your favorite programming language?  It is python.


In [112]:
# Let's repeat the above example using f-strings
for q, r in zip(questions, responses):
    print(f'What is your {q}?  My {q} is {r.title()}.')

What is your name?  My name is Akuien.
What is your interest?  My interest is Data Science & Computer Science.
What is your favorite programming language?  My favorite programming language is Python.


## Functions
A function is essentially an object capable of receiving an input and generating an output based on a specific set of instructions. A `Python` function is defined as follows:

In [96]:
print(f"""
def func_name(arg_1, arg_2, ...):

   do something here!

   return [output]
""")


def func_name(arg_1, arg_2, ...):

   do something here!

   return [output]



### Examples of Python Functions

In [2]:
# Define a function for extracting first names 
def welcome_learners(full_name):
    names = full_name.split()
    
    if len(names) >= 2:
        first_name = names[0]
        return f"Hello, {first_name.title()}! Welcome to Alierwai DataStudio!!"
    else:
        return "Please provide at least two names."



In [3]:
# Example usage:
greeting = welcome_learners('alier pach')
print(greeting)

Hello, Alier! Welcome to Alierwai DataStudio!!


In [4]:
# Squaring a number
def square_num(num):
    return num**2

square_num(5)
    

25

In [5]:
# Write a function for calculating a mean
from math import sqrt
def calc_mean(x) -> float:
    avg = sum(x) / len(x)
    
    return avg

In [6]:
x = [1, 2, 3 , 4, 5]
calc_mean(x)

3.0

##  Miscellaneous

### Mathematical Operations 
Below are the Python allowed arthemtic operations:

1. Addition: +
2. Subtraction: -
3. Multiplication: *
4. Division: /
5. Integer Division (modular division): // and
6. Exponents: **

In [104]:
# Using Python mathematical operations
# Addition: +
x = 9
y = 7
 
z = x + y
 
print(f'{z = }')

# Substraction: -

print(f'{x - y = }')

# Multiplication: *
print(f'{x * y = }')

# Division: /
print(f'{x / y = }')

# Integer division: //
print(f'{x // y = }')

# Exponents: **
print(f'{x**y = }')

# Modulo: %
print(f'{x % y = }')

z = 16
x - y = 2
x * y = 63
x / y = 1.2857142857142858
x // y = 1
x**y = 4782969
x % y = 2


## Practice Exercises

Let's go ahead and apply the knowledge you've gained in the preceding sections through these practice exercises. Challenge yourself and reinforce your understanding by solving the following problems.

**Exercise 1:**

i) Create the list of the first several Fibonacci numbers:

1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89.
 
ii) Print the first four elements of the list.

iii) Print every third element of the list starting from the first.

iv) Print the last element of the list.

v) Print the list in reverse order.

vi) Print the list starting at the last element and counting backward by every other element.

**Exercise 2: for loop**

We want to sum the first 100 perfect cubes. Letâ€™s do this in two ways.

a) Start off a variable called Total at 0 and write a for loop that adds the next perfect cube to the running total.

b) Write a for loop that builds the sequence of the first 100 perfect cubes. After the list has been built find the sum with the sum command.

**Exercise 3: while loop**

Write a while loop that sums the terms in the Fibonacci sequence until the sum is larger than 1000


# References

1. Harrison, M. (2021). Effective Pandas: Patterns for data manipulation.

2. Farrell, P., Fuentes, A., Kolhe, A. S., Nguyen, Q., Sarver, A. J., Tsatsos, M. (2020). The Statistics and Calculus with Python Workshop: A comprehensive introduction to mathematics in Python for artificial intelligence applications. Packt Publishing. Kindle Edition.
3. Python Documentation: https://docs.python.org/3/tutorial/datastructures.html
4. Python f-string tips & cheatsheets by **Trey Hunner**: https://www.pythonmorsels.com/string-formatting/?ck_subscriber_id=1493298043
5. Sullivan, E. (2022). *Numerical Methods: An Inquiry-Based Approach With Python.* Retrieved from:https://numericalmethodssullivan.github.io/index.html#resources