# Fondamenti di Analisi Dati - Spambase
## Rosario Scavo (1000037803)

Data can be downloaded from here
http://archive.ics.uci.edu/dataset/94/spambase

## Index:
* [Dataset description](#data-description)
  * [Attribute description](#attribute-description)
* [Dataset Analysis](#data-analysis)
  * [Dataset integrity](#data-integrity)
  * [Descriptive statistics](#descriptive-statistics)
  

# Dataset description <a id="data-description"></a>

The dataset includes various types of content that fall under the category of "spam", such as advertisements, chain letters, make-money-fast schemes, and pornography. The spam emails were collected from individuals who reported spam and the postmaster. On the other hand, non-spam emails were collected from personal and work files, where the presence of the word 'george' and the area code '650' were used as indicators of non-spam.

The central goal is to establish a classification rule to identify spam messages based on the frequency of specific words, numbers, characters, or consecutive capital letters in phrases. We will utilize various classification algorithms, including logistic regression (LR), Decision-Tree, and K-nearest neighbors algorithm (KNN), K-Means, to achieve this. These algorithms will be optimized through appropriate data preparation, transformation, and hyperparameter tuning using built-in Python functions. Additionally, we will determine the appropriate metrics to maximize and their impact on classification performance.

However, effective implementation requires thorough data analysis. Without prior data understanding, employing classifiers becomes challenging, if not impossible. This analysis will involve attribute exploration, variable type verification, missing value identification, feature-level metric analysis (mean, standard deviation, quantiles, etc.), feature importance determination for spam/non-spam classification, and outlier detection and analysis.

In [20]:
#imports
import pandas as pd
from matplotlib import pyplot as plt

In [24]:
names_list_filepath = 'spambase/names.txt'
attribute_names = []

with open(names_list_filepath, 'r') as file:
    attribute_names = file.read().splitlines()
    
data=pd.read_csv('spambase/spambase.data', names = attribute_names)
data

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,Class
0,0.00,0.64,0.64,0.0,0.32,0.00,0.00,0.00,0.00,0.00,...,0.000,0.000,0.0,0.778,0.000,0.000,3.756,61,278,1
1,0.21,0.28,0.50,0.0,0.14,0.28,0.21,0.07,0.00,0.94,...,0.000,0.132,0.0,0.372,0.180,0.048,5.114,101,1028,1
2,0.06,0.00,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.010,0.143,0.0,0.276,0.184,0.010,9.821,485,2259,1
3,0.00,0.00,0.00,0.0,0.63,0.00,0.31,0.63,0.31,0.63,...,0.000,0.137,0.0,0.137,0.000,0.000,3.537,40,191,1
4,0.00,0.00,0.00,0.0,0.63,0.00,0.31,0.63,0.31,0.63,...,0.000,0.135,0.0,0.135,0.000,0.000,3.537,40,191,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4596,0.31,0.00,0.62,0.0,0.00,0.31,0.00,0.00,0.00,0.00,...,0.000,0.232,0.0,0.000,0.000,0.000,1.142,3,88,0
4597,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.000,0.000,0.0,0.353,0.000,0.000,1.555,4,14,0
4598,0.30,0.00,0.30,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.102,0.718,0.0,0.000,0.000,0.000,1.404,6,118,0
4599,0.96,0.00,0.00,0.0,0.32,0.00,0.00,0.00,0.00,0.00,...,0.000,0.057,0.0,0.000,0.000,0.000,1.147,5,78,0


## Attribute description <a id="attribute-description"></a>
- The last column of 'spambase.data' (**Class**) indicates whether the email was considered spam (1) or not (0), i.e., unsolicited commercial email.
- Most attributes indicate whether a specific word or character frequently occurs in the email.
- Attributes 55-57 (*run-length* attributes) measure the length of sequences of consecutive capital letters.

### Definitions of Attributes:
1. **48 continuous real [0,100] attributes of type `word_freq_WORD`:**
   - Percentage of words in the email that match the specified word. 
   - Calculation: $\frac{100 \times (\text{Number of times the word appears in the email})}{\text{Total number of words in the email}}$


2. **6 continuous real [0,100] attributes of type `char_freq_CHAR`:**
   - Percentage of characters in the email that match the specified character.
   - Calculation: 100 * (number of occurrences of the character) / total characters in the email.

3. **1 continuous real [1,...] attribute of type `capital_run_length_average`:**
   - Average length of uninterrupted sequences of capital letters.

4. **1 continuous integer [1,...] attribute of type `capital_run_length_longest`:**
   - Length of the longest uninterrupted sequence of capital letters.

5. **1 continuous integer [1,...] attribute of type `capital_run_length_total`:**
   - Sum of the length of uninterrupted sequences of capital letters.
   - Total number of capital letters in the email.

6. **1 nominal {0,1} class attribute of type `spam`:**
   - Denotes whether the email was considered spam (1) or not (0), i.e., unsolicited commercial email.


In [19]:
data.keys()

Index(['word_freq_make', 'word_freq_address', 'word_freq_all', 'word_freq_3d',
       'word_freq_our', 'word_freq_over', 'word_freq_remove',
       'word_freq_internet', 'word_freq_order', 'word_freq_mail',
       'word_freq_receive', 'word_freq_will', 'word_freq_people',
       'word_freq_report', 'word_freq_addresses', 'word_freq_free',
       'word_freq_business', 'word_freq_email', 'word_freq_you',
       'word_freq_credit', 'word_freq_your', 'word_freq_font', 'word_freq_000',
       'word_freq_money', 'word_freq_hp', 'word_freq_hpl', 'word_freq_george',
       'word_freq_650', 'word_freq_lab', 'word_freq_labs', 'word_freq_telnet',
       'word_freq_857', 'word_freq_data', 'word_freq_415', 'word_freq_85',
       'word_freq_technology', 'word_freq_1999', 'word_freq_parts',
       'word_freq_pm', 'word_freq_direct', 'word_freq_cs', 'word_freq_meeting',
       'word_freq_original', 'word_freq_project', 'word_freq_re',
       'word_freq_edu', 'word_freq_table', 'word_freq_conference',


- **Number of instances:** 4601, of which 1813 are SPAM (39.4%)
- **Number of attributes:** 58 (57 continuous, 1 categorical representing the class label).

In [42]:
class_counts = data['Class'].value_counts()
print(class_counts)
print("\n")
data.info()

Class
0    2788
1    1813
Name: count, dtype: int64


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4601 entries, 0 to 4600
Data columns (total 58 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   word_freq_make              4601 non-null   float64
 1   word_freq_address           4601 non-null   float64
 2   word_freq_all               4601 non-null   float64
 3   word_freq_3d                4601 non-null   float64
 4   word_freq_our               4601 non-null   float64
 5   word_freq_over              4601 non-null   float64
 6   word_freq_remove            4601 non-null   float64
 7   word_freq_internet          4601 non-null   float64
 8   word_freq_order             4601 non-null   float64
 9   word_freq_mail              4601 non-null   float64
 10  word_freq_receive           4601 non-null   float64
 11  word_freq_will              4601 non-null   float64
 12  word_freq_people            4601 non

# Dataset analysis <a id="data-analysis"></a>

## Dataset integrity <a id="data-integrity"></a>

Before analyzing the data, let's verify that the 'Class' attribute only contains the values 1 and 0. Additionally, we will check for any NaN values in the dataset.

In [44]:
data['Class'].unique()

array([1, 0])

In [48]:
count_nan_in_df = data.isnull().sum().sum()
print(f'Number of NaN values: {count_nan_in_df}')

Number of NaN values: 0


Using the min and max lines of the `describe` function, which allow us to calculate the minimum and maximum values for each column, we can verify that indeed **the values of attributes indicating frequencies fall within the established ranges**. Specifically, the lower limit of the range is respected, while the upper limit is higher by one because the frequencies have been multiplied by 100 (percentage), as described previously.

**Issue: Matrix Sparsity** However, we observe that the quartile values are all zero. This is due to the sparsity of the matrix, where many frequency-related values are zero in most records. As a result, the data is concentrated near zero, introducing noise that could compromise the statistical analysis of the dataset. Hence, later in the project, I decided to replace values equal to 0.0 with NaN for attributes indicating frequencies in order to eliminate them.

In [49]:
data.describe()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,Class
count,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,...,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0
mean,0.104553,0.213015,0.280656,0.065425,0.312223,0.095901,0.114208,0.105295,0.090067,0.239413,...,0.038575,0.13903,0.016976,0.269071,0.075811,0.044238,5.191515,52.172789,283.289285,0.394045
std,0.305358,1.290575,0.504143,1.395151,0.672513,0.273824,0.391441,0.401071,0.278616,0.644755,...,0.243471,0.270355,0.109394,0.815672,0.245882,0.429342,31.729449,194.89131,606.347851,0.488698
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.588,6.0,35.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.065,0.0,0.0,0.0,0.0,2.276,15.0,95.0,0.0
75%,0.0,0.0,0.42,0.0,0.38,0.0,0.0,0.0,0.0,0.16,...,0.0,0.188,0.0,0.315,0.052,0.0,3.706,43.0,266.0,1.0
max,4.54,14.28,5.1,42.81,10.0,5.88,7.27,11.11,5.26,18.18,...,4.385,9.752,4.081,32.478,6.003,19.829,1102.5,9989.0,15841.0,1.0


For simplicity, we will change the class type to bool and rename it to 'spam.' Consequently, when a record has `spam=True`, it indicates that the email is spam.

In [52]:
data['spam'] = data['Class'].astype(bool)
data = data.drop(columns=['Class'])
data['spam']

KeyError: 'Class'

## Descriptive statistics <a id="descriptive statistics"></a>