In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Dataset Representation

## Processing of Dataset into Dataframe
Load the dataset into a DataFrame. Perform necessary operations to properly load your dataset into a DataFrame.

In [14]:
col_names = ["atmpt1", "atmpt2", "atmpt3", "atmpt4", "atmpt5", "atmpt6", "atmpt7", "atmpt8"]
df = pd.read_csv('Dataset1.csv', names=col_names,dtype=np.float32)
X = df.iloc[:, 3:11].values
y = df.iloc[:, 2].values
X=X.astype(int)
y=y.astype(int)

df

Unnamed: 0,atmpt1,atmpt2,atmpt3,atmpt4,atmpt5,atmpt6,atmpt7,atmpt8
0,5.0,14.0,,,,,,
1,0.0,2.0,10.0,12.0,,,,
2,3.0,5.0,13.0,14.0,16.0,17.0,19.0,
3,1.0,3.0,4.0,7.0,9.0,10.0,14.0,17.0
4,15.0,16.0,17.0,,,,,
...,...,...,...,...,...,...,...,...
295,8.0,13.0,14.0,,,,,
296,4.0,6.0,9.0,11.0,,,,
297,2.0,4.0,8.0,11.0,12.0,,,
298,5.0,6.0,7.0,9.0,13.0,14.0,16.0,


## Brief Description of the Dataset

The dataset we chose is named "Dataset1" which contains 300 observations containing a series of numbers sorted in ascending order. The amount of numbers per row varies between observations and is not consistent within the dataset. Looking at the initial structure of the dataset afetr turning into a Datframe, we can safely assume that it is an Associate Rule Mining dataset.

## Structure of the Dataset

In [3]:
df.columns

Index(['atmpt1', 'atmpt2', 'atmpt3', 'atmpt4', 'atmpt5', 'atmpt6', 'atmpt7',
       'atmpt8'],
      dtype='object')

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   atmpt1  300 non-null    float64
 1   atmpt2  267 non-null    float64
 2   atmpt3  222 non-null    float64
 3   atmpt4  189 non-null    float64
 4   atmpt5  151 non-null    float64
 5   atmpt6  109 non-null    float64
 6   atmpt7  80 non-null     float64
 7   atmpt8  40 non-null     float64
dtypes: float64(8)
memory usage: 18.9 KB


## Discuss the observations in the dataset file. 
What does each observation represent? Since the dataset description is not provided, the group can presume the entity represented by each observation. For example:
- For an association rule mining dataset, the group may presume that an observation represents a customer transaction from a store, a list of words in a document, and other similar examples.
- For a clustering dataset, the group may presume that an observation represents a song from a group of songs, an image from a group of images, and other similar examples.
- For a collaborative filtering dataset, the group may presume that an observation represents a movie being rated by people, a book being rated by readers, and other similar examples.

The Dataset that will be used for this Project is named "Dataset1" which contains 300 observations each containing a series of numbers which, based on our internal discussion, we will represent as a student's set of scores for a retakeable online quiz. As the dataset is an **association rule mining** dataset, we chose to represent the Dataset as such because its data fits the description of our representation quite well and is very relevant to our current situation during the pandemic.

## Discuss the variables in the dataset file. 
Since the description of the dataset is not provided, the group can presume the entity represented by each variable. For example:
- For an association rule mining dataset, the group may presume that a variable represents the presence of a certain item in a customer transaction from a store, the presence of a word in a document, and other similar examples.
- For a clustering dataset, the group may presume that a variable represents a certain characteristic or feature of a song (i.e., value representing the tempo, rhythm, pitch, and others), a certain characteristic or feature of an image (i.e., amount of black, amount of white, and others), and other similar examples.
- For a collaborative filtering dataset, the group may presume that a variable represents a rating of a specific person to a movie,

There are 8 variables in the dataset our group has chosen and they represent attempts made by the student to answer the online quiz. This is why we chose to name every column (as seen below) after each of the students' attempt in taking the online quiz. 

In [5]:
df.columns

Index(['atmpt1', 'atmpt2', 'atmpt3', 'atmpt4', 'atmpt5', 'atmpt6', 'atmpt7',
       'atmpt8'],
      dtype='object')

# Exploratory Data Analysis
Perform exploratory data analysis comprehensively to gain a good understanding of your
dataset. The exploratory data analysis should guide you in formulating the research
questions of the project.

In this section of the notebook, you must fulfill the following:
• Identify 2 interesting exploratory data analysis questions. Properly state the
questions in the notebook.

• Answer the EDA questions using both:
    o Numerical Summaries – measures of central tendency, measures of
    dispersion, and correlation
    
    o Visualization – Appropriate visualization should be used. Each visualization should be accompanied by a brief explanation.

• To emphasize, both numerical summary and visualization should be present to answer each question. The whole process should be supported with verbose textual descriptions of your procedures and findings.

# Data Mining

Identify the correct data mining technique to apply to your chosen dataset. The technique that you will apply should be appropriate for the dataset. Apply the data mining technique with the provided hyperparameters and answer the provided questions.

For Association Rule Mining:
- Use the rule_miner.py file from our exercises. Make sure that your code is working properly. Set support_t to 10 and the confidence_t to 0.6.
- Perform association rule mining.
- Answer the question: Using the provided support threshold and confidence threshold, what is/are the association rules that we derived from the dataset?

For Clustering:
- State the number of observations per group before clustering.
- Use the kmeans.py file from our exercises. Make sure that your code is working properly. Set the k, start_var, end_var, num_observations, and data to their appropriate values according to the dataset.
- Perform clustering with maximum iterations set to 300.
- Answer the question: After clustering, how many observations of each class are included in per cluster?

For Collaborative Filtering:
- Use the collaborative_filtering.py file from our exercises. Make sure that your code is working properly. Set k to 5.
- Perform collaborative filtering.
- Answer the question: Give the top 5 items that are most similar to the item at index 0.

Import the `RuleMiner` class

In [6]:
from rule_miner import RuleMiner

Set `support_t` equal to `10` and `confidence_t` equal to `0.6`. The field `support_t` represents the support threshold, while the field `confidence_t` represents the confidence threshold.

In [7]:
rule_miner = RuleMiner(10, 0.6)

As of now, our dataset is represented as a list of list. Instead of using this representation, we will convert our dataset to a matrix represented as a `pandas` `DataFrame`. The `DataFrame` will contain 300 rows - equivalent to the number of observations in the dataset, and 8 columns - equivalent to the number of distinct items in the dataset. The value in the cell in row `x` and column `y` is 1 if item `y` is in observation (basket) `x`, otherwise, the value in the cell in row `x` and column `y` is 0.

In [11]:
syn_df = pd.DataFrame([[0 for _ in range(8)] for _ in range(300)], columns=[i for i in range(8)])

for i, df in enumerate(df):
    syn_df.iloc[i, df] = 1
    

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

Open `rule_miner.py` file and complete the `get_support()` function. This function returns the support for an itemset. The support of an itemset refers to the number of baskets wherein the itemset is present.

Implement the `get_support()` function. Inline comments should help you in completing the contents of the function. Upon implementing the function, execute the code below then answer the questions.

In [10]:
print(rule_miner.get_support(df, [0.0]))
print(rule_miner.get_support(df, [0.0, 1.0]))
print(rule_miner.get_support(df, [0.0, 1.0, 2.0]))

KeyError: "None of [Float64Index([0.0], dtype='float64')] are in the [columns]"

In [None]:
frequent_itemsets = rule_miner.get_frequent_itemsets(df)
print(frequent_itemsets)

The frequent itemsets in the dataset, given the support threshold 10, is the set: `[['atmpt1'], ['atmpt2']]`

Using the `get_rules()` function in `rule_miner.py`, let us list all the possible rules for all frequent itemsets in our dataset. The `get_rules()` function returns a list of rules produced from an itemset.

In [None]:
for itemset in frequent_itemsets:
    print(rule_miner.get_rules(itemset))

# Insights and Conclusions
Clearly state your answers from the data to answer each provided question. Make sure that all conclusions are backed up with proper data mining procedures.