<a href="https://colab.research.google.com/github/qiaojunch/CS3700-Networks-and-Distributed-Systems/blob/master/Assignment_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS6220 Data Mining: Assignment 1


In [4]:
import pandas as pd

## Problem1

The cardinality of a set or collection of items is the number of unique items in that collection. Write a function called `cardinality_items` that takes a .csv text string file as input, where the format is as the below, and calculates the cardinality of the set of all the grocery items in any given dataset.

What is the cardinality in “basket data.csv”?

Data format:

ketchup, butter, diapers\
bread, diapers, ketchup\
butter, bread, ketchup


### P1 Solution:

In [12]:
def cardinality_items( filename ):
  '''
  Takes a filename "*.csv" and returns an integer
  '''
  # Read the CSV file into a Pandas DataFrame
  df = pd.read_csv(filename, sep='delimiter', header=None)

  # Stack the DataFrame to convert it to a single column
  stacked = df.stack()

  # Use set to get unique items and calculate cardinality
  cardinality = len(set(stacked))

  return cardinality

In [15]:
filename = 'https://raw.githubusercontent.com/qiaojunch/cs6220-data-mining/main/data/basket_data.csv'
result = cardinality_items(filename)
print(f"Cardinality of grocery items: {result}\n")

Cardinality of grocery items: 78



  df = pd.read_csv(filename, sep='delimiter', header=None)


## Problem2

Write a function called all_itemsets that takes a list of unique items and an integer k as input, and the output is a list of all possible unique itemsets with non-repeating k items. That is, the output is L = [S1, S2, · · · SN ], a list of all possible sets, where each Si has k items.

For example,

`all_itemsets( ["ham", "cheese", "bread"], 2 )`

should result in:

`[ ["ham", "cheese"], ["ham", "bread"], [’cheese", "bread"] ]`

You should not need any library functions.

### P2 Solution:

In [16]:
def all_itemsets(items, k):
    # Check if k is greater than the length of items
    if k > len(items):
        return []

    # Initialize the list to store all itemsets
    result = []

    # Recursive function to generate itemsets
    def generate_itemsets(current_itemset, remaining_items, remaining_k):
        if remaining_k == 0:
            result.append(current_itemset.copy())
            return

        for i, item in enumerate(remaining_items):
            current_itemset.append(item)
            generate_itemsets(current_itemset, remaining_items[i + 1:], remaining_k - 1)
            current_itemset.pop()

    generate_itemsets([], items, k)

    return result

In [17]:
# Example usage:
items = ["ham", "cheese", "bread"]
k = 2
result = all_itemsets(items, k)
print(result)

[['ham', 'cheese'], ['ham', 'bread'], ['cheese', 'bread']]


## Netflix Dataset


In [1]:
#@title Download the data from website
!wget -nc https://course.ccs.neu.edu/cs6220/fall2023/homework-1/netflix-data/movie_titles.csv
!wget -nc https://course.ccs.neu.edu/cs6220/fall2023/homework-1/netflix-data/combined_data_1.txt
!wget -nc https://course.ccs.neu.edu/cs6220/fall2023/homework-1/netflix-data/combined_data_2.txt
!wget -nc https://course.ccs.neu.edu/cs6220/fall2023/homework-1/netflix-data/combined_data_3.txt
!wget -nc https://course.ccs.neu.edu/cs6220/fall2023/homework-1/netflix-data/combined_data_4.txt

from IPython.display import clear_output
clear_output()

print("Data in combined_data_1.txt looks like this: \n")
!head -5 combined_data_1.txt

print("\n\nData in movie_titles.csv looks like this: \n")
!head -5 movie_titles.csv

Data in combined_data_1.txt looks like this: 

1:
1488844,3,2005-09-06
822109,5,2005-05-13
885013,4,2005-10-19
30878,4,2005-12-26


Data in movie_titles.csv looks like this: 

1,2003,Dinosaur Planet
2,2004,Isle of Man TT 2004 Review
3,1997,Character
4,1994,Paula Abdul's Get Up & Dance
5,2004,The Rise and Fall of ECW


In [18]:
f = open("movie_titles.csv", encoding ="cp1252")
data_lines = open("combined_data_1.txt", "r").readlines()

## Problem 3

Let’s review combined_data_*.txt.

a) How many total records of movie ratings are there in the entire dataset (over all of combined_data_*.txt)?

b) How many total unique users are there in the entire dataset (over all of combined_data_*.txt)?

c) What is the range of years that this data is valid over?

In [26]:
import pandas as pd
from glob import glob

def analyze_movie_ratings(files_pattern):
    # Step 1: Get a list of all file paths matching the pattern
    file_paths = glob(files_pattern)

    # Initialize variables for counting records and unique users
    total_records = 0
    unique_users = set()
    years = []

    # Iterate through each file
    for file_path in file_paths:
        # Step 2: Read the file into a list of lines
        with open(file_path, 'r') as file:
            lines = file.readlines()

        # Initialize variables for movie_id and data
        movie_id = None
        data = []

        # Step 3: Iterate through each line
        for line in lines:
            if line.endswith(":\n"):
                # Save the previous movie's data
                if movie_id is not None:
                    df = pd.DataFrame(data, columns=['CustomerID', 'Rating', 'Date'])
                    total_records += len(df)
                    unique_users.update(df['CustomerID'].unique())
                    years.extend(pd.to_datetime(df['Date']).dt.year.tolist())

                # Start a new movie
                movie_id = int(line.rstrip(":\n"))
                data = []
            else:
                # Collect rating data
                parts = line.strip().split(',')
                data.append(parts)

        # Process the last movie in the file
        if movie_id is not None:
            df = pd.DataFrame(data, columns=['CustomerID', 'Rating', 'Date'])
            total_records += len(df)
            unique_users.update(df['CustomerID'].unique())
            years.extend(pd.to_datetime(df['Date']).dt.year.tolist())

    # Step 4: Get the range of years
    min_year = min(years)
    max_year = max(years)

    # Return results
    return total_records, len(unique_users), (min_year, max_year)

In [27]:
# Example usage:
files_pattern = 'combined_data_*.txt'
total_records, unique_users, years_range = analyze_movie_ratings(files_pattern)

print(f'a) Total records of movie ratings: {total_records}')
print(f'b) Total unique users: {unique_users}')
print(f'c) Range of years: {years_range}')

a) Total records of movie ratings: 100480507
b) Total unique users: 480189
c) Range of years: (1999, 2005)


## Problem 4

Let’s review movie_titles.csv.

a) How many movies with unique names are there? That is to say, count the distinct names of the movies.

b) How many movie names refer to four different movies?

## Problem 5

Let’s review both.

a) How many users rated exactly 200 movies?

b) Of these users, take the lowest user ID and print out the names of the movies that this person liked the most (all 5 star ratings).