## Question 1

Select two datasets and create a user defined function in Python that leverages the fundamental features of the Python without using Pandas built in functions such as 

numeric_df =df.select_dtypes(include=['number']

categorical_df = df.select_dtypes(exclude=['number']

Your function should accept a Data Frame as input and categorize its columns into numeric and categorical types, then display the lists of numeric and categorical columns. Include a section in your report where you discuss your interpretation of this task and its significance within the field of artificial intelligence and make sure to run the code and attach a screenshot of your machine in the appendix of your report. 


In [91]:
def categorize_columns(dataset):
    num_cols = []  # List to store names of numeric columns
    cat_cols = []  # List to store names of categorical columns

    print(f"Analyzing columns for dataset: {list(dataset.keys())}")  # Print the column names of the input dataset

    for column in dataset.keys():  # Iterate over each column in the dataset
        is_numeric = True  # Flag to keep track if the column is numeric or not
        num_values = []  # List to store numeric values in the column
        cat_values = []  # List to store categorical values in the column
        print(f"\nChecking column '{column}':")  # Print the name of the column being checked

        for value in dataset[column]:  # Iterate over each value in the column
            if isinstance(value, int):  # Check if the value is an integer
                num_values.append(value)  # Append numeric value to num_values list
                print(f"Value '{value}' is numeric.")  # Print that the value is numeric
            else:
                cat_values.append(value)  # Append non-numeric value to cat_values list
                print(f"Value '{value}' is not numeric.")  # Print that the value is not numeric

        # Check if the column is purely numeric, purely categorical, or a mixture
        if num_values and not cat_values:  # If num_values is not empty and cat_values is empty
            num_cols.append(column)  # Column is numeric, append to num_cols
            print(f"Column '{column}' is numeric.")  # Print that the column is numeric
        elif cat_values and not num_values:  # If cat_values is not empty and num_values is empty
            cat_cols.append(column)  # Column is categorical, append to cat_cols
            print(f"Column '{column}' is categorical.")  # Print that the column is categorical
        else:  # If both num_values and cat_values are not empty
            print(f"Column '{column}' contains a mixture of numeric and categorical values.")  # Print that the column contains a mixture
            num_cols.append(column)  # Append column to num_cols
            cat_cols.append(column)  # Append column to cat_cols

    print("\nNumeric columns:", num_cols)  # Print the list of numeric columns
    print("Categorical columns:", cat_cols)  # Print the list of categorical columns

    return num_cols, cat_cols  # Return lists of numeric and categorical column names


In [93]:
# Creating two sample datasets, each with one column
dataset1 = {'A': [1, 2, 3, 4, 5]}  # Dataset with a single numeric column
dataset2 = {'B': ['apple', 'banana', 'cherry', 'date', 'elderberry']}  # Dataset with a single categorical column

# Call the function for each dataset
num_cols1, cat_cols1 = categorize_columns(dataset1)  # Categorize columns in dataset1
num_cols2, cat_cols2 = categorize_columns(dataset2)  # Categorize columns in dataset2





Analyzing columns for dataset: ['A']

Checking column 'A':
Value '1' is numeric.
Value '2' is numeric.
Value '3' is numeric.
Value '4' is numeric.
Value '5' is numeric.
Column 'A' is numeric.

Numeric columns: ['A']
Categorical columns: []
Analyzing columns for dataset: ['B']

Checking column 'B':
Value 'apple' is not numeric.
Value 'banana' is not numeric.
Value 'cherry' is not numeric.
Value 'date' is not numeric.
Value 'elderberry' is not numeric.
Column 'B' is categorical.

Numeric columns: []
Categorical columns: ['B']


## Q2 (15 marks)



Write a function, the function should achieve the following objectives:



In pandas:
- Concatenate the two datasets, which you have used in Question1,  along the rows.
- Remove any duplicate rows.
- Print the number of rows and columns in the resulting DataFrame.




In NumPy:
- Calculate the correlation matrix for all numeric columns.
- Identify the pair of columns with the highest correlation coefficient.
- Print the names of these columns along with their correlation coefficient.

Include a section in your report where you discuss your comprehension of this task and its relevance in the field of data pre-processing and analysis using pandas and NumPy libraries. Make sure to run the code and attach a screenshot of your machine in the appendix of your report.


In [113]:
import pandas as pd
import numpy as np

# Creating sample datasets from Question 1
dataset1 = {'A': [1.2, 2.5, 3.7, 4.1, 5.3], 'E': [10.5, 20.2, 30.8, 40.1, 50.6]}
dataset2 = {'B': ['apple', 'banana', 'cherry', 'date', 'elderberry']}

# Additional datasets with duplicate entries
dataset3 = {'C': [10.2, 20.7, 30.1, 10.2, 20.7]}
dataset4 = {'D': ['red', 'green', 'blue', 'red', 'green']}

def concatenate_and_analyze(data1, data2, data3, data4):
    # Convert dictionaries to DataFrames
    df1 = pd.DataFrame.from_dict(data1, orient='columns')
    df2 = pd.DataFrame.from_dict(data2, orient='columns')
    df3 = pd.DataFrame.from_dict(data3, orient='columns')
    df4 = pd.DataFrame.from_dict(data4, orient='columns')

    # Pandas operations
    print("Pandas operations:")
    print("-------------------")

    # Concatenate the four DataFrames along columns
    df_concat = pd.concat([df1, df2, df3, df4], axis=1)
    print("Concatenated DataFrame:")
    print(df_concat)

    # Remove duplicate rows
    df_concat.drop_duplicates(inplace=True, keep=False)
    print("\nDataFrame after removing duplicates:")
    print(df_concat)

    # Print the number of rows and columns
    num_rows, num_cols = df_concat.shape
    print(f"\nNumber of rows: {num_rows}")
    print(f"Number of columns: {num_cols}")

    # NumPy operations
    print("\nNumPy operations:")
    print("------------------")

    # Convert DataFrame to NumPy array for numeric columns
    numeric_cols = df_concat.select_dtypes(include=[np.number]).columns
    numeric_data = df_concat[numeric_cols].to_numpy()

    # Calculate correlation matrix
    if len(numeric_data) > 0:
        corr_matrix = np.corrcoef(numeric_data.T)
        print("Correlation matrix:")
        print(corr_matrix)

        # Find pair of columns with highest correlation
        num_cols = len(numeric_cols)
        max_corr = 0
        max_cols = None

        for i in range(num_cols):
            for j in range(i + 1, num_cols):
                corr_coef = corr_matrix[i, j]
                if corr_coef > max_corr:
                    max_corr = corr_coef
                    max_cols = (numeric_cols[i], numeric_cols[j])

        # Print the pair of columns with highest correlation
        print(f"\nColumns with highest correlation: {max_cols}")
        print(f"Correlation coefficient: {max_corr}")
    else:
        print("No numeric columns found in the DataFrame.")

# Call the function with the sample datasets
concatenate_and_analyze(dataset1, dataset2, dataset3, dataset4)



Pandas operations:
-------------------
Concatenated DataFrame:
     A     E           B     C      D
0  1.2  10.5       apple  10.2    red
1  2.5  20.2      banana  20.7  green
2  3.7  30.8      cherry  30.1   blue
3  4.1  40.1        date  10.2    red
4  5.3  50.6  elderberry  20.7  green

DataFrame after removing duplicates:
     A     E           B     C      D
0  1.2  10.5       apple  10.2    red
1  2.5  20.2      banana  20.7  green
2  3.7  30.8      cherry  30.1   blue
3  4.1  40.1        date  10.2    red
4  5.3  50.6  elderberry  20.7  green

Number of rows: 5
Number of columns: 5

NumPy operations:
------------------
Correlation matrix:
[[1.         0.98978573 0.34388804]
 [0.98978573 1.         0.20941052]
 [0.34388804 0.20941052 1.        ]]

Columns with highest correlation: ('A', 'E')
Correlation coefficient: 0.9897857278385591


## Q3 (10 marks)


Writ a python program to implement the below algorithm:

- Create a NumPy array with 1000 random elements (numbers) and take the mean of every 5 sample window. 
- [datamean]  < -- mean ( [ numpy_array (1 : 5 : end ) ] )
- [data]min < -- min ( [ datamean ] )
- [data]max < -- max ( [ datamean ] )
- [ value ] max < -- max ( abs ( [data]max),  abs([ [data]min])

Include a section in your report  where you discuss your interpretation and significance of this task and make sure to run the code and attach a screenshot of your machine in the appendix of your report.


In [120]:
import numpy as np

# Create a NumPy array with 1000 random elements
np_array = np.random.rand(1000)

# Take the mean of every 5 sample window
data_mean = [np_array[i:i+5].mean() for i in range(0, len(np_array), 5)]
data_mean = np.array(data_mean)

# Calculate [data]min
data_min = data_mean.min()

# Calculate [data]max
data_max = data_mean.max()

# Calculate [value]max
value_max = max(abs(data_max), abs(data_min))

print("Original NumPy array:")
print(np_array)
print("\nMean of every 5 sample window:")
print(data_mean)
print(f"\n[data]min: {data_min}")
print(f"[data]max: {data_max}")
print(f"[value]max: {value_max}")


Original NumPy array:
[0.77045826 0.70558955 0.61630134 0.77398436 0.89598279 0.01102146
 0.07847726 0.23110438 0.63032356 0.10333476 0.40789831 0.78808898
 0.11337574 0.66487454 0.36112124 0.15406345 0.25156219 0.40770339
 0.62672329 0.25572568 0.95856298 0.58782512 0.92134421 0.10696187
 0.95143226 0.06190021 0.36467256 0.19054121 0.66940232 0.5564287
 0.95939094 0.62664306 0.17657121 0.50277786 0.4836765  0.64844518
 0.1407331  0.27215938 0.17949838 0.75780676 0.84985037 0.56553512
 0.71776431 0.90722213 0.22817064 0.26355384 0.22045796 0.91631078
 0.5039407  0.71430883 0.89870253 0.06493211 0.05220996 0.16447701
 0.36739494 0.31143555 0.52097007 0.99988226 0.61134347 0.80622903
 0.45909772 0.40592023 0.38580332 0.46589672 0.65780106 0.31588738
 0.50571847 0.42233036 0.11597369 0.69188775 0.87111703 0.04611042
 0.01339897 0.11076686 0.44765395 0.93383718 0.40237405 0.15944177
 0.05093065 0.46902024 0.23715897 0.61312045 0.07298765 0.92380514
 0.58536792 0.56462671 0.13294545 0.68112