## Exploring 'tags' table
### 1 Library and duckdb file import

In [1]:

#expanding initial exploration of the "tags" table

import duckdb, pandas as pd
from pathlib import Path

#create or connect if it already exists
con = duckdb.connect("movielens100K.duckdb")

### 2 General dataset description

In [2]:
#Ver estrutura da tabela (colunas e tipos de dados)
con.sql("DESCRIBE tags").df()

Unnamed: 0,column_name,column_type,null,key,default,extra
0,userId,INTEGER,YES,,,
1,movieId,INTEGER,YES,,,
2,tag,VARCHAR,YES,,,
3,timestamp,TIMESTAMP WITH TIME ZONE,YES,,,


Comments:

 - "userID": BIGINT 64 bits value;
 - "movieID":  BIGINT 64 bits value;
 - "tag": VARCHAR;
 - "timestamp":  BIGINT 64 bits value;


In [3]:
#check the data types of each column
con.sql("PRAGMA table_info('tags')").df()

Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,userId,INTEGER,False,,False
1,1,movieId,INTEGER,False,,False
2,2,tag,VARCHAR,False,,False
3,3,timestamp,TIMESTAMP WITH TIME ZONE,False,,False


Comments:

- The column `notnull` (boolean) shows four zero values.  
  - Therefore, this table allows null (NULL) values.  
  - Since the DataFrame was loaded using `read_csv_auto`, DuckDB does not enforce NOT NULL constraints.

- The column `dflt_value` shows the default value assigned when no other value is specified during an INSERT operation.  
  - In datasets imported with `read_csv_auto` (such as MovieLens), this column almost always appears as NULL.

- The column `pk` (boolean) indicates whether the column is part of the table’s primary key.  
  - It is not.


### 3 Dataset individual basic exploration
#### 3.1 Dataset composition

In [4]:
#check first 10 rows
con.sql("SELECT * FROM tags LIMIT 10").df()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,2015-10-24 20:29:54+01:00
1,2,60756,Highly quotable,2015-10-24 20:29:56+01:00
2,2,60756,will ferrell,2015-10-24 20:29:52+01:00
3,2,89774,Boxing story,2015-10-24 20:33:27+01:00
4,2,89774,MMA,2015-10-24 20:33:20+01:00
5,2,89774,Tom Hardy,2015-10-24 20:33:25+01:00
6,2,106782,drugs,2015-10-24 20:30:54+01:00
7,2,106782,Leonardo DiCaprio,2015-10-24 20:30:51+01:00
8,2,106782,Martin Scorsese,2015-10-24 20:30:56+01:00
9,7,48516,way too long,2007-01-25 01:08:45+00:00


#### 3.2 Row Counts

In [5]:
# count total number of rows
con.sql("SELECT COUNT(*) AS total_registos FROM tags").df()

Unnamed: 0,total_registos
0,3683


#### 3.3 Number of unique movies with tags

In [6]:
#find number of unique movies with tags
con.sql("SELECT COUNT(DISTINCT movieId) AS unique_movies_with_tags FROM tags").df()


Unnamed: 0,unique_movies_with_tags
0,1572


#### 3.4 Average number of tags per movie

In [7]:
#averge number of tags per movie
con.sql("""
SELECT 
    COUNT(*) * 1.0 / COUNT(DISTINCT movieId) AS avg_tags_per_movie
FROM tags
""").df()


Unnamed: 0,avg_tags_per_movie
0,2.342875


#### 3.5 Missing values

In [8]:
#count the number of missing values
con.sql("""
SELECT
    COUNT(*) - COUNT(userId) AS missing_userId,
    COUNT(*) - COUNT(movieId) AS missing_movieId,
    COUNT(*) - COUNT(tag) AS missing_tag,
    COUNT(*) - COUNT(timestamp) AS missing_timestamp
FROM tags
""").df()

Unnamed: 0,missing_userId,missing_movieId,missing_tag,missing_timestamp
0,0,0,0,0


#### Fechar a ligação

In [9]:
con.close()
print("Ligação fechada.")

Ligação fechada.
