# HW00: CODING ENVIRONMENT SETUP AND, INTRODUCTION TO PYTHON

This is Assignment 00 for the course "Introduction to Data Science" at the Faculty of Information Technology, University of Science, Vietnam National University, Ho Chi Minh City.

(Latest update: 09/08/2024)

Student Name:

Student ID:

---

## **Assignment Objectives**

In this assignment, we will introduce you to setting up a Python environment with Anaconda and familiarize you with the Python programming language. Additionally, we will introduce some basic data types that you will frequently encounter in the future, including digital images and text.

## **How to Complete and Submit the Assignment**

&#9889; **Note**: You should follow the instructions below. If anything is unclear, you need to contact the teaching assistant or instructor immediately for timely support.

**How to Do the Assignment**

You will work directly on this notebook file. First, fill in your full name and student ID (MSSV) in the header section of the file above. In the file, complete the tasks in sections marked:
```python
# YOUR CODE HERE
raise NotImplementedError()
```
Or for optional code sections:
```python
# YOUR CODE HERE (OPTION)
```
For markdown cells, complete the answer in the section marked:
```markdown
YOUR ANSWER HERE
```

**How to Submit the Assignment**

Before submitting, select `Kernel` -> `Restart Kernel & Run All Cells` if you are using a local environment, or `Runtime -> Restart session` and run all if using Google Colab, to ensure everything works as expected.

Next, create a submission folder with the following structure:
- Folder named `MSSV` (for example, if your student ID is `1234567`, name the folder `1234567`)
    - File `HW00.ipynb` (no need to submit other files)

Finally, compress this `MSSV` folder in `.zip` format (not `.rar` or any other format) and submit it via the link on Moodle.\
<font color=red>Please make sure to strictly follow this submission guideline.</font>

## Imports

In [1]:
import platform
import os
import sys

import numpy as np
import matplotlib.pyplot as plt

In [2]:
from scipy.ndimage import gaussian_filter
from skimage.data import camera
import matplotlib.pyplot as plt

In [3]:
import re

## Helper functions

In [4]:
def checkingPlatform() -> dict:
    """
    Retrieves platform-related information for the current operating system environment.

    Returns:
        dict: A dictionary containing the following platform information:
            - 'platform_system': The system/OS name, e.g., 'Linux', 'Windows', etc.
            - 'os_name': The name of the operating system dependent module imported, e.g., 'posix', 'nt', etc.
            - 'sys_platform': A string representing the platform the script is running on, e.g., 'linux', 'win32', etc.
            - 'platform_release': The release version of the operating system.
            - 'platform_version': The detailed version of the operating system.
            - 'platform_description': A string that combines various platform details.
    """
    return {
        'platform_system': platform.system(),
        'os_name': os.name,
        'sys_platform': sys.platform,
        'platform_release': platform.release(),
        'platform_version': platform.version(),
        'platform_description': platform.platform()
    }

## Check your coding environment

This section is used to check your programming environment. I don’t care which operating system you are using (Windows 11/10 or any Linux distribution). You need to set up the environment that works best for you on your own.

In [None]:
checkingPlatform()

## Question 1: Based on your knowledge of Python libraries for Data Science, give a brief introduction to one of them.

In this question, you're expected to select a popular Python library used in Data Science and provide a concise introduction. For example, you might choose to introduce a library like NumPy. Your response should follow this outline:

1. Library Introduction:
- Start by identifying the library you’ve chosen. Provide a brief history or background about the library, including its purpose and when it was developed.
- Discuss its popularity and significance in the Data Science ecosystem.
2. Role in Data Science:
- Explain how this library supports common Data Science tasks. Highlight the specific types of operations or challenges that it simplifies (e.g., data manipulation, statistical analysis, numerical computing, etc.).
- Mention key features or advantages that make this library indispensable for Data Scientists.
3. Function Usage and Examples:
- Demonstrate the library's usage with one or two small code examples. Focus on core functionality that shows its relevance in real-world applications (e.g., matrix manipulation, statistical calculations, or data visualization).
- Explain the code snippets and provide context on how these functions can be applied to typical Data Science problems.

For example, NumPy plays a crucial role in Data Science by enabling efficient storage and manipulation of numerical data. It simplifies tasks such as:
- Array operations: NumPy allows for fast element-wise operations on large datasets, which is essential for data analysis and preprocessing.
- Mathematical functions: It offers a wide range of mathematical functions, such as linear algebra operations, statistical functions, and Fourier transforms.
- Interoperability: NumPy integrates seamlessly with other libraries like Pandas, Matplotlib, and SciPy, enhancing its utility in the Data Science workflow.

Example:

```python
import numpy as np

array = np.array([10, 20, 30, 40])
print("Original Array:", array)

mean_value = np.mean(array)
std_dev = np.std(array)
print("Mean:", mean_value)
print("Standard Deviation:", std_dev)
```

### Giới thiệu thư viện Pandas `pandas`

#### Pandas là gì?

Pandas là một thư viện Python được thiết kế để làm việc với dữ liệu có cấu trúc (VD: dữ liệu dạng bảng như file Excel, các cơ sở dữ liệu, dữ liệu đa chiều). 

Pandas cung cấp các cấu trúc dữ liệu mạnh mẽ và linh hoạt, cùng các hàm phong phú để thao tác, làm sạch, phân tích và trực quan hóa dữ liệu.

*Funfact: "Pandas" được lấy ý tưởng từ 2 cụm "Panel Data" và "Python Data Analysis", được phát triển bởi Wes MCKinney vào năm 2008.*

#### Vai trò của Pandas trong Khoa học dữ liệu

**Thao tác với dữ liệu:** Pandas có các hàm giúp dễ dàng đọc dữ liệu từ nhiều nguồn khác nhau (CSV, Excel, SQL,...) và làm sạch dữ liệu bằng cách xử lí các giá trị thiếu, trùng lặp.

**Phân tích dữ liệu:** Cung cấp các hàm thực hiện các phép thống kê, lọc, nhóm, sắp xếp và tính toán các chỉ số quan trọng trong dữ liệu.

**Chuẩn bị dữ liệu cho mô hình**: Ta có thể dùng Pandas để chuẩn bị dữ liệu để dưa vào các mô hình học máy, như: mã hóa các biến category, chia dữ liệu thành các tập (test, validation, train).

#### Sử dụng và ví dụ:

In [None]:
import pandas as pd

# Tạo DataFrame từ một dictionary
data = {'Quả': ['Táo', 'Chuối', 'Cam'],
        'Số lượng': [10, 15, 8],
        'Giá': [15000, 12000, 18000]}
df = pd.DataFrame(data)

# Hiển thị DataFrame
print(df)

# Truy cập cột
print(df['Quả'])

# Lọc dữ liệu
print(df[df['Số lượng'] > 10])

# Tính toán thống kê
print(df['Giá'].mean())

### Introduction to Digital Image Processing 

#### Read digital image

In [None]:
def build_gaussian_pyramid(ima, levelmax):
    """return a list of subsampled images (using gaussion pre-filter"""
    r = [ima]
    current = ima
    for level in range(levelmax):
        lp = gaussian_filter(current, 1.0)
        sub = lp[::2, ::2]
        current = sub
        r.append(current)
    return r


def build_pyramid(ima, levelmax):
    """return a list of subsampled images (using gaussion pre-filter"""
    r = [ima]
    current = ima
    for level in range(levelmax):
        sub = current[::2, ::2]
        current = sub
        r.append(current)
    return r


im = camera()[::2, ::2]

# build filtered and non-filtered pyramids
N = 4
fpyramid = build_gaussian_pyramid(im, N)
nfpyramid = build_pyramid(im, N)

for f, nf in zip(fpyramid, nfpyramid):

    plt.figure(figsize=[7, 7])
    plt.subplot(1, 2, 1)
    plt.imshow(f, cmap=plt.cm.gray, interpolation="nearest")
    plt.title("guaussian pyramid")
    plt.subplot(1, 2, 2)
    plt.imshow(nf, cmap=plt.cm.gray, interpolation="nearest")
    plt.title("subsampling pyramid")

Image Modification

In [7]:
def imageplot(f, str='', sbpt=[]):
    """
        Use nearest neighbor interpolation for the display.
    """
    if sbpt != []:
        plt.subplot(sbpt[0], sbpt[1], sbpt[2])
    imgplot = plt.imshow(f, interpolation='nearest')
    imgplot.set_cmap('gray')
    if str != '':
        plt.title(str)

In [None]:
imageplot(-im, '-M', [1,2,1])
imageplot(im[::-1,:], 'Flipped', [1,2,2])

Blurring is achieved by computing a convolution with a kernel.

Compute the low pass Gaussian kernel. Warning, the indexes needs to be modulo n in order to use FFTs.

In [None]:
sigma = 7
n = 256
t = np.concatenate((np.arange(0, n / 2 + 1), np.arange(-n / 2, -1)))
[Y, X] = np.meshgrid(t, t)
h = np.exp(-(X**2 + Y**2) / (2.0 * float(sigma) ** 2))
h = h / sum(h)
imageplot(np.fft.fftshift(h))

Compute the periodic convolution ussing FFTs

In [10]:
Mh = np.real(np.fft.ifft2(np.fft.fft2(im) * np.fft.fft2(h)))

Display

In [None]:
imageplot(im, "Image", [1, 2, 1])
imageplot(Mh, "Blurred", [1, 2, 2])

Several differential and convolution operators are implemented.

In [12]:
def grad(f):
    """
    Compute a finite difference approximation of the gradient of a 2D image, assuming periodic BC.
    """
    S = f.shape
    #   g = np.zeros([n[0], n[1], 2]);
    s0 = np.concatenate((np.arange(1, S[0]), [0]))
    s1 = np.concatenate((np.arange(1, S[1]), [0]))
    g = np.dstack((f[s0, :] - f, f[:, s1] - f))
    return g

In [None]:
G = grad(im)
imageplot(G[:,:,0], 'd/ dx', [1, 2, 1])
imageplot(G[:,:,1], 'd/ dy', [1, 2, 2])

#### Fourier Transform
The 2D Fourier transform can be used to perform low pass approximation and interpolation (by zero padding).

Compute and display the Fourier transform (display over a log scale). The function fftshift is useful to put the 0 low frequency in the middle. After fftshift, the zero frequency is located at position (n/2+1,n/2+1)
.

In [None]:
Mf = np.fft.fft2(im)
Lf = np.fft.fftshift(np.log(abs(Mf) + 1e-1))
imageplot(im, 'Image', [1, 2, 1])
imageplot(Lf, 'Fourier transform', [1, 2, 2])

In [None]:
string = "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."
separators = "; ", ", ", " ", "."

def statistics_in_word(word):
    dict = {}
    for each_char in word:
        if not(each_char in dict):
            dict[each_char] = 1
        else:
            dict[each_char] += 1
    return dict
            
# Solution for requiremnet 01
print("Solution for requiremnet 01")
def tokenizing(str):
    list_tokens = re.split(r'[ ,.]', str)
    for token in list_tokens:
        if token == '':
            list_tokens.remove(token)
    return list_tokens
            
print('Tokenize')
print(tokenizing(string))

# Solution for requiremnet 02
print("Solution for requiremnet 02")
def counting(list_tokens):
    print('Statistics')
    for each_token in list_tokens:
        print(each_token + ":")
        print(statistics_in_word(each_token))
        
    
counting(tokenizing(string))

# custome tokenizer
def custom_tokenizer(sepr_list, str_to_split):
    # create regular expression dynamically
    regular_exp = '|'.join(map(re.escape, sepr_list))
    return re.split(regular_exp, str_to_split)

print('Custome tokenizer')
print(custom_tokenizer(separators, string))

### Introduction to NLP 

Problem: Tokenize and count the number of characters in each word.

Tokenize the following sentence: "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."

Provide a list of the number of alphabetic characters in each word in the order they appear in the sentence.

In [None]:
import re
from collections import defaultdict

string = "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."
separators = [";", ",", " ", "."]

def statistics_in_word(word):
    """Count the occurrences of each character in the word."""
    char_count = defaultdict(int)  # Use defaultdict for cleaner code
    for char in word:
        char_count[char] += 1
    return dict(char_count)

# Solution for requirement 01
print("Solution for requirement 01")

def tokenizing(input_string):
    """Tokenize the input string into words, removing empty tokens."""
    list_tokens = re.split(r"[ ,.]", input_string)
    return [token for token in list_tokens if token]  # List comprehension to filter out empty tokens

print("Tokenize:")
tokens = tokenizing(string)
print(tokens)

# Solution for requirement 02
print("Solution for requirement 02")

def counting(tokens):
    """Print the character statistics for each token."""
    print("Statistics:")
    for token in tokens:
        print(f"{token}: {statistics_in_word(token)}")

counting(tokens)

# Custom tokenizer
def custom_tokenizer(separators, str_to_split):
    """Tokenize the input string based on a list of separators."""
    # Create a regular expression dynamically
    regular_exp = "|".join(map(re.escape, separators))
    return re.split(regular_exp, str_to_split)

print("Custom tokenizer:")
print(custom_tokenizer(separators, string))