# Microsoft Security Incident Prediction

This notebook provides a comprehensive workflow for predicting security incidents using a real-world dataset from Microsoft. The process includes:

1. **Data Acquisition**: Downloading the relevant CSV file containing cybersecurity incident data.
2. **Data Preprocessing**: Cleaning and preparing the dataset for analysis, including handling missing values, encoding categorical variables, and scaling numerical features.
3. **Data Exploration**: Exploring the dataset to gain insights into its structure, identifying patterns and correlations, and assessing the distribution of key features.
4. **Visualizations**: Creating various plots to visualize trends and relationships in the data, which helps in understanding the underlying patterns that could inform predictive models.
5. **Model Training**: Training a variety of models, both linear and non-linear, including regression-based approaches and more complex machine learning algorithms such as decision trees and ensemble methods.
6. **Results Analysis**: Evaluating model performance using standard metrics like accuracy, precision, recall, and F1 score, with a focus on understanding the trade-offs between different models and identifying the most effective approach for predicting incident severity.

This notebook aims to provide valuable insights for cybersecurity operations by leveraging machine learning techniques to predict and classify incidents based on historical data.


## Downloading the Data from Kaggle

In [1]:
!pip install kagglehub --quiet

In [1]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from kagglehub import dataset_download

def create_reduced_file(original_file_path, reduced_file_path, target_column, sample_size=10000):
    """
    Crea un archivo reducido a partir del archivo original aplicando undersampling estratificado.

    :param original_file_path: Ruta al archivo original.
    :param reduced_file_path: Ruta donde se guardará el archivo reducido.
    :param target_column: Nombre de la columna objetivo para estratificación.
    :param sample_size: Tamaño del conjunto reducido.
    """
    df_original = pd.read_csv(original_file_path)
    print(f"El archivo original tiene {len(df_original)} filas.")
    
    # Realizar undersampling estratificado
    df_reduced, _ = train_test_split(
        df_original,
        train_size=sample_size,
        stratify=df_original[target_column],
        random_state=42
    )
    
    df_reduced.to_csv(reduced_file_path, index=False)
    print(f"Archivo reducido creado con {len(df_reduced)} filas manteniendo proporciones de clases.")

# Configuración de rutas
dataset_folder = os.getcwd()
reduced_file_path = os.path.join(dataset_folder, 'microsoft_Reduced.csv')

if not os.path.exists(reduced_file_path):
    file_path = dataset_download("Microsoft/microsoft-security-incident-prediction")
    original_file_path = os.path.join(file_path, "GUIDE_Test.csv")
    
    create_reduced_file(original_file_path, reduced_file_path, target_column='IncidentGrade')
else:
    df_reduced = pd.read_csv(reduced_file_path)
    print(f"Archivo cargado con {len(df_reduced)} filas.")

print(df_reduced.head())


NameError: name 'os' is not defined

In [3]:
pip list | grep kaggle


Note: you may need to restart the kernel to use updated packages.
