
## Programming for Data Science

### Lecture 9: Modules and Packages

### Instructor: Farhad Pourkamali 


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/farhad-pourkamali/CUSucceedProgrammingForDataScience/blob/main/Lecture9_ModulesPackages.ipynb)


### Introduction
<hr style="border:2px solid gray">

* In Python, `functions`, `modules`, and `packages` are essential constructs that facilitate code modularization, promoting code reuse, maintainability, and readability in large and complex projects.

* Functions
    * Functions in Python are blocks of reusable code that perform a specific task.
    * They are defined using the `def` keyword followed by a function name, parameters, and a block of code.

In [1]:
def is_prime(num):
    """
    Check if a given number is prime.
    
    A prime number is a number that has exactly two distinct positive divisors: 1 and itself.

    Parameters:
    - num (int): The input number to check.

    Returns:
    - bool: True if the number is prime, False otherwise.
    """
    
    foundPrimes = range(2, int(num**0.5)+1)
    
    for factor in foundPrimes:
        if num % factor == 0:
            return False
        
    return True 

In [2]:
print(is_prime(3), is_prime(4), is_prime(9), is_prime(11))

True False False True


In [3]:
def list_primes(num_max):
    """
    Generate a list of prime numbers up to a specified maximum.

    This function iterates through all numbers from 2 up to (but not including) the specified maximum number. 
    
    Parameters:
    - num_max (int): The upper limit (exclusive) for checking prime numbers. 

    Returns:
    - list: A list of prime numbers less than `num_max`.
    
    """
    
    my_list = []
    
    for num in range(2, num_max):
        if is_prime(num):
            my_list.append(num)
            
    return my_list 

In [4]:
list_primes(20)

[2, 3, 5, 7, 11, 13, 17, 19]

### Restart Kernel

Click on the "Kernel" menu above and select "Restart & Clear Output."

* Modules
    * A module is a file containing Python code, including variables, functions, and classes.
    * They allow organizing code into separate files for better maintainability.
    * Modules can be reused across different parts of a project or in other projects.
    * Create a module named `primes.py`.
    * When the interpreter executes the `import primes` statement, it searches for `primes.py`.

In [5]:
import primes

* Note that `import primes` does not make the module contents directly accessible.

In [6]:
print(is_prime(4))

False


* The module contents are only accessible when prefixed with `primes` via dot notation.

In [7]:
print(primes.is_prime(4))

False


* The `from <module_name> import <name(s)>` syntax in Python is used to import specific names (functions, classes, or variables) directly. This allows you to use those names without referencing the module they belong to.

In [8]:
from primes import is_prime

print(is_prime(4))

False


* This syntax allows you to import all names (functions, classes, and variables) from a module:

```
from module_name import *

```

* This isn’t necessarily recommended in large-scale production code.

In [9]:
from primes import *

print(is_prime(4), list_primes(10))

False [2, 3, 5, 7]


### The time module

* The `time` module in Python provides various time-related functions, allowing you to work with time values and perform operations such as measuring time intervals, formatting time, and pausing the execution of a program.

In [10]:
import time

type(time)

module

* The `process_time()` function in the `time` module is useful for measuring the `CPU time` consumed by a specific process.

* To utilize `process_time`, begin by invoking the function and assigning the result to a variable, let's say `t0`, just before initiating the execution of the code.

* Subsequently, after the execution, invoke `process_time` again (saving the result in a variable, `t1`). 

* The disparity `t1 - t0` represents the elapsed time and serves as a metric to gauge the efficiency of your program's execution speed.


In [11]:
# Start measuring process time
start_process_time = time.process_time()

# Code to measure (e.g., some CPU-intensive computation)
for num in range(10**7):
    num_sq = num ** 2

# Stop measuring process time
end_process_time = time.process_time()

# Calculate elapsed process time
elapsed_process_time = end_process_time - start_process_time

print(f"Elapsed Process Time: {elapsed_process_time:0.3f}")

Elapsed Process Time: 1.566


### The math module 

* The `math` module in Python provides a set of mathematical functions for performing operations related to mathematics. 

* These functions cover a wide range of mathematical operations, including basic arithmetic, trigonometry, logarithms, exponentiation, and more. 

In [12]:
import math 

type(math)

module

In [13]:
import math

# Calculate the square root
sqrt_result = math.sqrt(25)

# Calculate the sine of 30 degrees
sin_result = math.sin(math.radians(30))

# Calculate the factorial of 5
factorial_result = math.factorial(5)

print("Square Root:", sqrt_result)
print("Sine of 30 degrees:", sin_result)
print("Factorial of 5:", factorial_result)


Square Root: 5.0
Sine of 30 degrees: 0.49999999999999994
Factorial of 5: 120


### The os module
* The `os` module provides a way of interacting with the operating system.

* It includes functions for file and directory manipulation, environment variables, and more.

In [14]:
import os

current_directory = os.getcwd()

print(current_directory)  


/Users/farhad/Library/CloudStorage/OneDrive-TheUniversityofColoradoDenver/Teaching/2024_spring/MATH1376/Lectures


### The NumPy random module

* `numpy.random` is a module within the NumPy package for generating random numbers. 
* It includes various distributions (uniform, normal, etc.) and functions for random sampling, permutation, and seed generation. 

In [15]:
import numpy.random 

type(numpy.random)

module

In [16]:
# Generate an array of 5 random numbers from a standard normal distribution
random_numbers = numpy.random.randn(5)

print("Random Numbers:", random_numbers)


Random Numbers: [-0.45729927  0.1953599  -0.31408132 -0.09394116 -1.47712394]


* Packages

    *  In Python, a package is a way of organizing related `modules` into a single directory hierarchy.
    * A package can contain sub-packages, modules, and even other packages.
    * It helps in organizing and structuring large codebases.
    
* Importing modules within packages in Python involves specifying the module's path relative to the package. Here are several possible syntaxes.

    1. Import a module directly from the package using its absolute path.
    ```
    from package_name import module_name

    ```

In [17]:
from numpy import random 

random.randn(5)

array([-0.29447578,  0.90756962, -0.29821876,  0.22940491,  0.30447027])

2. Import a module with an alias.
```
from package_name import module_name as alias_name

```

In [18]:
from numpy import random as rdm

rdm.randn(5)

array([ 0.14908361, -0.0727114 , -0.34022533,  1.41234252, -0.35403399])

3. Import specific names from a module within the package.
```
from package_name.module_name import name1, name2

```

In [19]:
from numpy.random import randn

randn(5)

array([ 0.5812908 , -1.6451087 , -0.15608294, -2.69660932,  0.30348029])

### scikit-learn Overview

* scikit-learn (sklearn):

    * scikit-learn is a machine learning library in Python that provides tools for data analysis and modeling.
    * It is built on NumPy, SciPy, and Matplotlib, and it is designed to be user-friendly and efficient.
    * scikit-learn offers a wide range of machine learning algorithms for classification, regression, clustering, dimensionality reduction, and more.
    * It follows a consistent API design, making it easy to use and switch between different algorithms.
    * The library includes `modules` for data preprocessing, model evaluation, and model selection.
    
* The `linear_model` Module
    * The `linear_model` module in scikit-learn focuses on linear models for regression and classification.

In [20]:
import numpy as np 

from sklearn.linear_model import LinearRegression

# Example data
X = np.array([[1], [2], [3]])
y = np.array([2, 4, 6])

# Create a linear regression model
model = LinearRegression(fit_intercept=False)

# Train the model
model.fit(X, y)

# Make predictions
predictions = model.predict([[4]])

# Get coefficients
coefficients = model.coef_

print("Coefficients:", coefficients)

print("Predictions:", predictions)

Coefficients: [2.]
Predictions: [8.]


### HW 9

1. Create a Python `module` named `math_operations.py`. Inside the module, define three functions:
    * `add_numbers(a, b)`: Takes two numbers, a and b, and returns their sum.
    * `multiply_numbers(a, b)`: Takes two numbers, a and b, and returns their product.
    * `power_of_two(n)`: Takes a number n and returns its square.

Import the module `math_operations` into this notebook. Use the functions from the module to 
add two numbers and print the result, multiply two numbers and print the result, and
calculate the square of a number and print the result.


2. Given a 1D numpy array containing random floating-point numbers, apply min-max scaling to transform all the elements of the array so that they fall within the range of 0 to 1.

* Import the `preprocessing` module from `sklearn`.
* Initialize a `MinMaxScaler` object with the feature range set to (0, 1).
* Reshape the given 1D numpy array to 2D since `MinMaxScaler` expects a 2D array. You can use the `.reshape(-1, 1)` method for this purpose.
* `Fit` the scaler to the data and then `transform` the data using the scaler.
* Convert the scaled data back to a 1D numpy array.
* Print the original array and the scaled array to show the transformation.

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

In [None]:
import numpy as np

# Given 1D numpy array
original_array = np.array([34, 47, 22, 15, 63, 55])


3. Outliers are observations that lie an abnormal distance from other values in a random sample from a population. In a sense, they can significantly skew your data, leading to inaccurate analyses and interpretations. The Interquartile Range (IQR) method is a simple yet effective way to detect outliers, relying on quartile measures that are resistant to the extreme values.

* Import the necessary `numpy` library for data handling.
* Generate a 1D numpy array with 100 random elements to represent your dataset.

* Quartile Calculation:

    * Calculate the first quartile (Q1) and third quartile (Q3) of your dataset. These represent the 25th and 75th percentiles, respectively.
    * Interquartile Range Computation: Determine the Interquartile Range (IQR) by subtracting Q1 from Q3. The IQR represents the middle 50% of your data and is used to gauge the dataset's spread.
    
* Outlier Identification:

    * Define outliers as those points that fall below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR. Identify and print these outlier values from your dataset.

In [None]:
import numpy as np

# do not change the seed
np.random.seed(31)

data = np.random.randn(100)

data