</div>

<div style="text-align:center">
    <img src="./src/gabarra/img/gabarra_library_logo.png" alt="Logo" width="250"/>
</div>

<div style="text-align:center">

### **LIBRARY DEVELOPED BY THE 2024 DATA SCIENCE CLASS** 
The bridge, Bilbao, Spain

</div>

### **CONTENT**

<a id="indice"></a>

* [Introduction](#topic-1)
* [Installation and basic usage](#topic-2)
* [Library modules and their built-in functions](#seccion-3)
* [Functions desciptions](#seccion-4)

### **Introduction** <a id="topic-2"></a>

Gabarara is a Python library develop by students undergoing a Data Science curriculum. It serves as a comprehensive suite of functions designed to streamline the workflow of data scientists, offering robust modules for Data Analysis, Data Processing, Data Visualization, and Machine Learning endeavors.

The development lifecycle of Gabarara beggins with a stringent selection process, where the most efficient functions are curated based on their efficacy and applicability. These chosen functions undergo rigorous testing to ensure seamless operation and reliability. Subsequently, detailed documentation is prepared to provide users with a comprehensive understanding of the library's functionality and operation, ensuring clarity and ease of use for all users.

The official documentation of Gabarara is structured into several sections to facilitate user comprehension and efficient utilization:

Installation and basic usage: This section delineates the steps required to install the Gabarara library and provides guidance on how to set it up for use.
Module Descriptions: Lists the functions of each module for easy recognition 
Function Descriptions: A comprehensive description of each function developed within Gabarara is provided, offering detailed insights into its functionality, parameters, and usage guidelines.

### **Installation and basic usage** <a id="topic-2"></a>

### Installation 
To install gabarra library, utilize `pip install`, the Python package manager. Open a command prompt or terminal and execute the following command:

In [None]:
# pip install gabarra

This will download and install the latest version of gabarra library.

### Usage 
The library offers flexibility in usage, allowing users to either import the entire library or specify individual function modules as needed. The following scripts illustrate these options:

In [None]:
# import gabarra as gb
# from gabarra import *
# import gabarra.data_analysis as gda
# import gabarra.data_processing as gdp
# import gabarra.data_visualization as gdv
# import gabarra.machine_learning as gml

### **LIBRARY MODULES AND ITS BUILT-IN FUNCTIONS**
The gabarra library is composed by 4 diferents modules: Data Analysis, Data Processing, Data Visualization and Machine Learning. 

Each module includes the following functions:


| Data Analysis              | Data Processing            |
|----------------------------|----------------------------|
| `filter_rows()`            | `create_dummies()`         |
| `remove_outliers()`        | `fill_zeros_with_mean()`   |
| `basic_data_analysis()`    | `fill_nans_with_mean()`    |
| `outlier_meanSd()`         | `convert_to_numeric()`     |
| `data_report()`            |                            |
| `missing_values_summary()` |                            |

<div style="page-break-after: always;"></div>
<div style="page-break-after: always;"></div>

| Data Visualization              | Machine Learning                |
|---------------------------------|---------------------------------|
| `missing_values_summary()`      | `linear_regression()`           |
| `plot_numeric_distributions()`  | `calculate_metrics()`           |
| `plot_pie_charts()`             | `unSupervisedCluster()`         |
| `plot_interactive_line_chart()` | `gradient_boosting_regression()`|
| `plot_interactive_pie_chart()`  | `xgboost_regression()`          |
|                                 | `most_common_words()`           |
|                                 | `y_generator()`                 |
|                                 | `random_forest_regression()`    |


### **FUNCTION DESCRIPTION**
In this segment, a detailed exposition of each function's functionality is provided, elucidating its purpose, usage, and advantages. By delving into the inner workings of the function, users gain a comprehensive understanding of its capabilities and how it can be leveraged to address specific requirements. Furthermore, insights into the outputs generated by applying the function are furnished, enabling users to discern the tangible benefits derived from its utilization. Through this comprehensive elucidation, users can make informed decisions regarding the integration of the function into their workflows, thereby maximizing its utility and effectiveness.


### **Data Analisys Module**
Includes functions that help you work more quickly and efficiently on tasks carried out during data analysis.

[Back to content](#indice)

### *filter_rows*
The **`filter_rows()`**  function is designed to filter rows in a Pandas DataFrame based on a specified condition. This function is useful for data analysis tasks where you need to subset data to meet specific criteria.

This function leverages the query method from Pandas to execute the filtering condition. The try block ensures that any exceptions raised during the filtering process are caught and reported, helping users debug potential issues with the condition string.

**Technical Notes**
 * The condition parameter must be a valid string expression that Pandas' query method can parse. This includes standard comparison operators and logical operators.
 * If the condition string is not valid, an exception will be caught, and an error message will be printed.\
 
By providing a flexible and easy-to-use filtering mechanism, filter_rows simplifies the process of subsetting DataFrames based on dynamic conditions, enhancing the efficiency of data manipulation tasks.

In [1]:
def filter_rows(df, condition):
    '''
    Filters rows in a DataFrame based on a condition.

    Parameters:
    df(pd.DataFrame): The DataFrame to filter.
    condition (str): The condition to filter the rows. Must be a valid Pandas expression.

    Returns:
    pd.DataFrame: A filtered DataFrame that meets the specified condition.
    '''
    try:
        filtered_df = df.query(condition)
        return filtered_df
    except Exception as e:
        print(f"Error al filtrar filas: {e}")
        return None

### *remove_outliers*
The **`remove_outliers()`** function is designed to eliminate outliers from a specified column in a Pandas DataFrame. This function is particularly useful in data preprocessing, where handling outliers is essential for accurate data analysis and modeling.

This function calculates the first quartile (Q1) and third quartile (Q3) to determine the interquartile range (IQR). It then computes the lower and upper fences to identify outliers. Rows with values outside these fences are excluded from the resulting DataFrame.

In [None]:
def remove_outliers(df, column_name):
     
    '''
        Define the remove_outliers function that takes a DataFrame df and a column_name as arguments.
        Calculate the first quartile (Q1), which is the value that separates the lowest 25% of the data. Use nanquantile to ignore any NaN values in the column.
        Calculate the third quartile (Q3), which is the value that separates the highest 25% of the data.
        Calculate the lower (lower_fence) and upper (upper_fence) limits to determine what values are considered outliers. Values below lower_fence or above upper_fence are considered outliers.
        Return a filtered DataFrame containing only the values within the calculated limits, thereby excluding outliers
    '''
    Q1 = np.nanquantile(df[column_name], 0.25)
    Q3  =np.nanquantile(df[column_name], 0.75)
    IQ = Q3 - Q1
    lower_fence = Q1 - 1.5 * IQ
    upper_fence = Q3 + 1.5 * IQ
    return df[(df[column_name] <= upper_fence) & (df[column_name] >= lower_fence)]

### *basic_data_analysis*

The **`basic_data_analysis()`**  function provides a comprehensive analysis of a Pandas DataFrame, including data exploration, cleaning, visualization, and statistical analysis. This function is ideal for initial data inspection and understanding the distribution and relationships within the dataset.

This function performs the following tasks:
1. Exploration: Displays the first few rows of the DataFrame.
2. Cleaning: Removes rows with null values and ensures numeric columns are of the correct data type.
3. Visualization: Plots a histogram of a specified numeric column.
4. Statistical Analysis: Provides descriptive statistics and a correlation matrix.
5. Linear Regression: Performs a linear regression analysis on two specified columns and prints the results.

By using basic_data_analysis, users can quickly gain insights into their dataset, identify potential issues, and understand underlying patterns and relationships.

In [None]:
def basic_data_analysis(df):
    """
    Performs a basic data analysis of a Pandas Dataframe
    Args:df (pd.DataFrame)
    Returns:None
    """
    # Explore the first records
    print("Primeras filas:")
    print(df.head())

    # Data cleaning
    df.dropna(inplace=True)  # Remove rows with null values
    df = df.astype({"columna_numerica": float})  # Fix data types

    # Visualization
    sns.set(style="whitegrid")
    plt.figure(figsize=(8, 6))
    sns.histplot(df["columna_numerica"], bins=20, kde=True)
    plt.title("Histograma de la columna_numerica")
    plt.xlabel("Valor")
    plt.ylabel("Frecuencia")
    plt.show()

    # Statistic analysis
    print("\nEstadísticas descriptivas:")
    print(df.describe())

    # Correlation
    print("\nMatriz de correlación:")
    print(df.corr())

    # Linear regression (Example)
    slope, intercept, r_value, p_value, std_err = stats.linregress(df["columna_x"], df["columna_y"])
    print(f"\nRegresión lineal: Pendiente={slope:.2f}, Intercepto={intercept:.2f}, R^2={r_value**2:.2f}")

### *outlier_meanSd*

The **`outlier_meanSd()`** function is designed to remove outliers from a specified feature in a Pandas DataFrame based on the mean and standard deviation method. This function is particularly useful for preprocessing data to ensure that extreme values do not skew the analysis or modeling results.

This function performs the following steps:
1. Calculate Mean and Standard Deviation: Computes the mean (media) and standard deviation (desEst) of the specified feature.
2. Define Thresholds: Calculates the lower (th1) and upper (th2) thresholds using the mean and standard deviation multiplied by the specified parameter (param).
3. Filter Data: Retains rows where the feature values fall within the defined thresholds or are null (NaN).
4. Return Result: Returns a new DataFrame with the filtered values, ensuring that the index is reset.

**Technical Notes**
* The default parameter value of 3 is commonly used to define the range for outlier detection as values beyond three standard deviations from the mean are typically considered outliers.
* The function includes null values in the resulting DataFrame, which can be useful for certain analysis scenarios where missing data should not be removed.

By using outlier_meanSd, users can effectively preprocess their data by removing extreme values, thereby enhancing the reliability and accuracy of subsequent data analysis and modeling tasks.

In [None]:
def outlier_meanSd(df, feature, param=3):  

    """"
    This function removes outliers and null values from a pandas DataFrame by:
        1.Calculate the mean (media) and standard deviation (desEst) of the specified feature (feature) in the DataFrame df.
        2.Define two thresholds (th1 and th2) using the mean and standard deviation multiplied by a parameter (param). 
        These thresholds are used to identify outliers.
        3.Filter the original DataFrame (df) based on the following conditions:
            a.Values must fall within the range [th1, th2].
            b.Include any null (NaN) values.
        4.Finally, return a new DataFrame with the filtered values.
    """
    media = df[feature].mean()
    desEst = df[feature].std()

    th1 = media - desEst*param
    th2 = media + desEst*param

    return df[((df[feature] >= th1) & (df[feature] <= th2))  | (df[feature].isnull())].reset_index(drop=True)

### *data_report*

The **`data_report()`** function is designed to generate a comprehensive report for a given DataFrame using the pandas library in Python. This function aims to provide a detailed overview of statistics and features pertaining to the input DataFrame, aiding in efficient data exploration and analysis.

This will print out the comprehensive report for the DataFrame df, providing insights into its column names, data types, percentage of missing values, number of unique values, and cardinality percentages for each column.



In [None]:
def data_report(df):

    """
    Parameters:
    df (pandas DataFrame): The input DataFrame for which the report is to be generated.

    This function generates a comprehensive report for a DataFrame using the pandas library.  
    It provides a detailed overview of statistics and features for the input DataFrame. This is 
    useful for efficient data exploration and analysis. 
        1.Column Names (`COL_N`): Creates a DataFrame called `cols` with the column names from 
        the input DataFrame `df`.
        2.Data Types (`DATA_TYPE`): Creates another DataFrame called `types` with the data types 
        of the columns in `df`.
        3.Missing Values (`MISSINGS (%)`): Calculates the percentage of missing values (NaN) in
        each column and creates a DataFrame called `percent_missing_df`.
        4.Unique Values (`UNIQUE_VALUES`): Calculates the number of unique values in each column 
        and creates a DataFrame called `unicos`.
        5.Cardinality (`CARDIN (%)`): Computes the percentage of cardinality (number of unique 
        values relative to the DataFrame size) and creates a DataFrame called `percent_cardin_df`.
        6.Concatenation and Transposition: Combines all the above DataFrames into one called 
        `concatenado`. Then, it transposes this DataFrame so that columns become indices and vice versa.
    """

    import pandas as pd
    # Get the NAMES
    cols = pd.DataFrame(df.columns.values, columns=["COL_N"])

    # Get the TYPES
    types = pd.DataFrame(df.dtypes.values, columns=["DATA_TYPE"])

    # Get the MISSINGS
    percent_missing = round(df.isnull().sum() * 100 / len(df), 2)
    percent_missing_df = pd.DataFrame(percent_missing.values, columns=["MISSINGS (%)"])

    # Get the UNIQUE VALUES
    unicos = pd.DataFrame(df.nunique().values, columns=["UNIQUE_VALUES"])

    percent_cardin = round(unicos['UNIQUE_VALUES']*100/len(df), 2)
    percent_cardin_df = pd.DataFrame(percent_cardin.values, columns=["CARDIN (%)"])

    concatenado = pd.concat([cols, types, percent_missing_df, unicos, percent_cardin_df], axis=1, sort=False)
    concatenado.set_index('COL_N', drop=True, inplace=True)

    return concatenado.T

### *missing_values_summary*

The **`missing_values_summary()`**  function is designed to generate a summary of missing values within a given DataFrame. This function aids in identifying and quantifying the extent of missing data in the DataFrame.

This will print out a summary of missing values in the DataFrame df, displaying the count and percentage of missing values for each column. This information facilitates the assessment of data quality and informs decisions regarding data handling strategies, such as imputation or removal of missing values.

In [None]:
def missing_values_summary(df):
    """
    Generates a summary of missing values in the DataFrame.

    Parameters:
    df (DataFrame): The DataFrame to analyze.

    Returns:
    DataFrame: A DataFrame with the count and percentage of missing values per column.
    """
    missing_summary = df.isnull().sum().to_frame(name='Missing Values')
    missing_summary['Percentage'] = (missing_summary['Missing Values'] / len(df)) * 100
    return missing_summary

### **Data Processing**
Includes functions that help you work more quickly and efficiently on tasks carried out during data processing.

[Back to content](#indice)

### *create_dummies*

The **`create_dummies()`**  function is designed to generate dummy variables for all object columns within a given DataFrame. This process is essential for preparing categorical data for machine learning models, as many algorithms require numerical inputs. The resulting DataFrame includes the original numeric variables along with the newly created dummy variables.

This will print out the DataFrame transformed, which includes dummy variables for categorical columns and retains the original numeric columns. This transformation is useful for preparing the DataFrame for machine learning tasks where categorical data needs to be represented numerically.

In [None]:
def create_dummies(df):
    """
    This function takes a DataFrame and creates dummy variables for all object columns.
    The resulting DataFrame includes the dummy variables along with the original numeric variables.

    Parameters:
    df (pd.DataFrame): The input DataFrame

    Returns:
    pd.DataFrame: The DataFrame with dummy variables and numeric columns
    """
    
    object_cols = df.select_dtypes(include=['object']).columns
    numeric_df = df.select_dtypes(exclude=['object'])

    
    dummies_df = pd.get_dummies(df[object_cols], drop_first=False) 

    
    final_df = pd.concat([numeric_df, dummies_df], axis=1)

    return final_df

### *fill_zeros_with_mean*

The **`fill_zeros_with_mean()`**  function is designed to handle zero values within a specified column of a DataFrame by replacing them with the column's mean, excluding zeros from the mean calculation. This imputation strategy helps in addressing missing or inconsistent data in numeric columns, enhancing the overall quality and reliability of the dataset.

This will print out the modified DataFrame, where zero values in the specified column ('column_name') have been replaced with the mean value, enhancing data consistency and integrity.

In [None]:
def fill_zeros_with_mean(df, column):
    """
    Fills zero values in a specified column of a DataFrame with the column's mean, excluding zeros.

    Args:
        df (pandas.DataFrame): The DataFrame containing the column to be imputed.
        column (str): The name of the column containing zero values.

    Returns:
        pandas.DataFrame: The modified DataFrame with zeros replaced by the mean.

    Raises:
        ValueError: If the specified column does not exist in the DataFrame.

    Warns:
        UserWarning: If there are no non-zero values in the column, a warning is issued
        indicating that the mean cannot be calculated and the column remains unchanged.


    """
    
    mean_value = df[df[column] != 0][column].mean()
    df[column] = df[column].replace(0, mean_value)

    return df

### *fill_nans_with_mean*

The **`fill_nans_with_mean()`**  function is designed to handle NaN (Not a Number) values within a specified column of a DataFrame by replacing them with the column's mean, excluding NaNs from the mean calculation. This imputation strategy assists in addressing missing or inconsistent data in numeric columns, enhancing the overall quality and reliability of the dataset.

This will print out the modified DataFrame, where NaN values in the specified column ('column_name') have been replaced with the mean value, enhancing data consistency and integrity.

In [None]:
def fill_nans_with_mean(df, column):
    """
    Fills NaN (Not a Number) values in a specified column of a DataFrame with the column's mean, excluding NaNs.

    Args:
        df (pandas.DataFrame): The DataFrame containing the column to be imputed.
        column (str): The name of the column containing NaN values.

    Returns:
        pandas.DataFrame: The modified DataFrame with NaN values replaced by the mean.

    Raises:
        ValueError: If the specified column does not exist in the DataFrame.

    """
    mean_value = df[column].mean(skipna=True)

    df[column] = df[column].fillna(mean_value)
    
    return df

### *fill_zeros_with_mean*

The **`convert_to_numeric()`** function facilitates the conversion of categorical variables in a pandas DataFrame into numerical features. This transformation is crucial for machine learning algorithms, as they typically require numeric inputs. The function offers various encoding options, including Label Encoding, One-Hot Encoding, and Frequency Encoding, catering to different data characteristics and modeling requirements.

motive (str, optional): The type of encoding to apply. Defaults to 'LabelEncoding'.
* LabelEncoding: Assigns a unique integer to each category.
* OneHotEncoding: Creates binary features for each category.
* FrequencyEncoding: Assigns a value based on category frequency.


Functionality Overview:
1. Validation of Encoding Motive:
    * The function checks if the provided encoding motive is valid. If not, it raises a ValueError.
2. Selection of Columns:
    * If no specific columns are provided, the function selects all categorical columns using select_dtypes.
3. Encoding Process:
    * For each selected column, the function applies the specified encoding method (LabelEncoding, OneHotEncoding, or FrequencyEncoding).
4. Return DataFrame:

The modified DataFrame with converted categorical columns is returned.
Exceptions:
* ValueError: If an invalid encoding motive is provided, a ValueError is raised.

In [None]:
def convert_to_numeric(df, motive='LabelEncoding', columns=list):
    """
    Converts categorical variables in a pandas DataFrame to numerical features.

    Args:
        df (pd.DataFrame): The DataFrame containing the categorical variables.
        motive (str, optional): The type of encoding to apply. Defaults to 'LabelEncoding'.
            - 'LabelEncoding': Assigns a unique integer to each category.
            - 'OneHotEncoding': Creates binary features for each category.
            - 'FrequencyEncoding': Assigns a value based on category frequency.
        columns (list): The columns to be converted. Defaults to all categorical columns.

    Returns:
        pd.DataFrame: The DataFrame with converted categorical columns.

    Raises:
        ValueError: If an invalid 'motive' is provided.
    """

    if motive not in ['LabelEncoding', 'OneHotEncoding', 'FrequencyEncoding']:
        raise ValueError(f"Invalid motive: '{motive}'. Valid options are 'LabelEncoding', 'OneHotEncoding', and 'FrequencyEncoding'.")

    if not columns:
        columns = df.select_dtypes(include=['category', 'object']).columns

    for col in columns:
        if motive == 'LabelEncoding':
            encoder = LabelEncoder()
            df[col] = encoder.fit_transform(df[col])
        elif motive == 'OneHotEncoding':
            encoder = OneHotEncoder(sparse=False)  
            encoded_df = pd.DataFrame(encoder.fit_transform(df[[col]]), columns=[f'{col}_{c}' for c in encoder.categories_[0]])
            df = pd.concat([df, encoded_df], axis=1).drop(col, axis=1)
        elif motive == 'FrequencyEncoding':
            category_counts = df[col].value_counts().to_dict()
            df[col] = df[col].replace(category_counts)

    return df

### **Data Visualization**
Includes functions that help you work more quickly and efficiently on tasks carried out during data visualization.

[Back to content](#indice)

### *missing_values_summary*

The **`missing_values_summary()`** function is designed to generate a summary of missing values within a given DataFrame. This function is crucial for identifying and quantifying the extent of missing data, which is an essential step in data cleaning and preprocessing.

Functionality Overview:
* Counting Missing Values:\
The function computes the count of missing values (NaNs) for each column in the input DataFrame using the isnull().sum() method.
* Calculating Percentage of Missing Values:\
It calculates the percentage of missing values relative to the total number of entries in each column.
* Creating Summary DataFrame:
    The results are organized into a DataFrame with two columns:
    * Missing Values: The count of missing values for each column.
    * Percentage: The percentage of missing values for each column.

This will print out a summary of missing values in the DataFrame df, displaying the count and percentage of missing values for each column. This information is vital for assessing data quality and determining appropriate data handling strategies, such as imputation or removal of missing values.


In [None]:
def missing_values_summary(df):
    """
    Generates a summary of missing values in the DataFrame.

    Parameters:
    df (DataFrame): The DataFrame to analyze.

    Returns:
    DataFrame: A DataFrame with the count and percentage of missing values per column.
    """
    missing_summary = df.isnull().sum().to_frame(name='Missing Values')
    missing_summary['Percentage'] = (missing_summary['Missing Values'] / len(df)) * 100
    return missing_summary

### *plot_numeric_distributions*

The **`plot_numeric_distributions()`** function is designed to visualize the distributions of all numeric columns in a given DataFrame. It generates histograms and boxplots for each numeric column, providing insights into the data's distribution and potential outliers. Additionally, if a categorical column is specified as hue, the plots will be differentiated based on the categories in this column, enabling comparison across different groups.

For each numeric column, it generates a pair of plots: a histogram and a boxplot.

A histogram is plotted using sns.histplot, optionally differentiated by the hue column if specified. The histogram displays the distribution of values, with density and a kernel density estimate (KDE) overlay.

A boxplot is generated using sns.boxplot, optionally differentiated by the hue column if specified. The boxplot visualizes the distribution, central tendency, and outliers of the column's values.

The function uses plt.show() to display the generated plots.

In [None]:
def plot_numeric_distributions(df, hue=None):
    """
    Plots histograms and boxplots for all numeric columns in the DataFrame.
    If a categorical column is specified as hue, the plots will be differentiated by this column.

    Parameters:
    df (pd.DataFrame): The DataFrame containing the data.
    hue (str, optional): The name of the categorical column used for differentiation. Default is None.

    Returns:
    None
    """
    # Filter numeric columns
    numeric_columns = df.select_dtypes(include=['float64', 'int64']).columns

    # Iterate over each numeric column
    for col in numeric_columns:
        if col != hue:
            fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14, 6))

            # Histogram
            sns.histplot(data=df, x=col, hue=hue, kde=True, element='step', stat='density', ax=axes[0])
            axes[0].set_title(f'Histogram of {col}' + (f' differentiated by {hue}' if hue else ''))

            # Boxplot
            sns.boxplot(data=df, x=hue, y=col, ax=axes[1])
            axes[1].set_title(f'Boxplot of {col}' + (f' differentiated by {hue}' if hue else ''))

            plt.tight_layout()
            plt.show()

### *plot_pie_charts*

The **`plot_pie_charts()`** function is designed to create pie charts for specified columns in a pandas DataFrame. This visualization is useful for showing the proportion of categories within each selected column, providing a clear visual representation of categorical data distribution.


In [None]:
def plot_pie_charts(df, columns):
    """
    Creates pie charts for the specified columns in the DataFrame.

    :param df: DataFrame containing the data.
    :param columns: List of columns for which pie charts will be created.
    """
    for column in columns:
        # Verificar si la columna existe en el DataFrame
        if column not in df.columns:
            print(f"Columna {column} no encontrada en el DataFrame.")
            continue

        # Número de categorías en la columna
        n_bins = df[column].nunique()

        # Obtener colores viridis
        colors = get_viridis_colors(n_bins)

        # Agrupar por la columna y sumar las órdenes
        data = df.groupby([column]).num_orders.sum()

        # Crear el gráfico de pastel
        plt.figure(figsize=(6, 6))
        plt.pie(data,
                labels=data.index,
                shadow=False,
                colors=colors,
                explode=[0.05] * n_bins,
                startangle=90,
                autopct='%1.1f%%', pctdistance=0.9,
                textprops={'fontsize': 8})
        plt.title(f"% de pedidos por {column}")
        plt.show()

### *plot_interactive_line_chart*

The **`plot_interactive_line_chart()`** function is designed to create an interactive line chart for specified columns in a pandas DataFrame using Plotly, a powerful graphing library. This visualization allows for interactive exploration of data trends over time or other continuous variables, with optional differentiation by a categorical column.

The function checks if the specified x_column, y_column, and color_column (if provided) exist in the DataFrame. If any column is not found, an appropriate message is printed, and the function exits.

The code will generate and display an interactive line chart based on the specified columns, providing an interactive way to explore data trends. If a color_column is provided, the lines will be color-coded based on the categories in that column.

In [None]:
def plot_interactive_line_chart(df, x_column, y_column, color_column=None):
    """
    Creates an interactive line chart for the specified columns in the DataFrame.

    Parameters:
    df (pd.DataFrame): The DataFrame containing the data.
    x_column (str): The name of the column to be used for the x-axis.
    y_column (str): The name of the column to be used for the y-axis.
    color_column (str, optional): The name of the column used for color differentiation. Default is None.

    Returns:
    None
    """
    # Check if the columns exist in the DataFrame
    if x_column not in df.columns:
        print(f"Column {x_column} not found in the DataFrame.")
        return
    if y_column not in df.columns:
        print(f"Column {y_column} not found in the DataFrame.")
        return
    if color_column and color_column not in df.columns:
        print(f"Column {color_column} not found in the DataFrame.")
        return

    # Create the line chart
    fig = px.line(df, x=x_column, y=y_column, color=color_column, title=f'{y_column} over {x_column}')

    # Update the layout for better appearance
    fig.update_layout(
        xaxis_title=x_column,
        yaxis_title=y_column,
        legend_title=color_column if color_column else 'Legend',
        hovermode='x unified'
    )

    fig.show()

### *plot_interactive_pie_chart*

The **`plot_interactive_pie_chart()`** function generates an interactive pie chart for a specified column in a pandas DataFrame using Plotly. This visualization is useful for displaying the distribution of categories within a specific column, providing a clear and interactive way to understand the proportion of each category.

The function checks if the specified column exists in the DataFrame. If the column is not found, an appropriate message is printed, and the function exits.

The function does not return any value. It directly displays the interactive pie chart.

In [None]:
def plot_interactive_pie_chart(df, column):
    """
    Creates an interactive pie chart for the specified column in the DataFrame.

    Parameters:
    df (pd.DataFrame): The DataFrame containing the data.
    column (str): The name of the column for which the pie chart will be created.

    Returns:
    None
    """
    # Check if the column exists in the DataFrame
    if column not in df.columns:
        print(f"Column {column} not found in the DataFrame.")
        return

    # Aggregate the data
    data = df[column].value_counts().reset_index()
    data.columns = [column, 'counts']

    # Create the pie chart
    fig = px.pie(data, values='counts', names=column, title=f'Percentage of Orders by {column}')

    # Update the layout for better appearance
    fig.update_traces(textposition='inside', textinfo='percent+label')
    fig.update_layout(uniformtext_minsize=12, uniformtext_mode='hide')

    fig.show()

### **Machine Learning Module**
Includes functions that help you work more quickly and efficiently on tasks carried out during data analysis.

[Back to content](#indice)

### *linear_regression*

The **`linear_regression()`** function performs linear regression on a given pandas DataFrame. It uses the target column specified by the user and calculates performance metrics such as Mean Squared Error (MSE) and R-squared (R²) score. This function is useful for predictive modeling and assessing the relationship between features and the target variable.

The code will perform linear regression on the provided DataFrame, using the specified target column, and return a dictionary with the model, predictions, and evaluation metrics.

In [None]:
def linear_regression(df:pd, target_column:str):
    """
    Performs linear regression on the given dataset without standardizing the data.

    Parameters:
    df (DataFrame): The DataFrame containing the data.
    target_column (str): The name of the target column.

    Returns:
    dict: A dictionary containing the model, predictions, mean squared error, and R^2 score.
    """
    # Splitting the data into features (X) and target (y)
    X = df.drop(columns=[target_column])
    y = df[target_column]

    # Splitting the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Creating and training the linear regression model
    model = LinearRegression()
    model.fit(X_train, y_train)

    # Making predictions
    y_pred = model.predict(X_test)

    # Calculating performance metrics
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    # Returning the results
    return {
        "model": model,
        "predictions": y_pred,
        "mean_squared_error": mse,
        "r2_score": r2
    }

### *calculate_metrics*

The **`calculate_metrics()`** function computes and returns a DataFrame containing key regression metrics for evaluating model performance. This function is useful for summarizing the performance of regression models in a structured and easily interpretable format.

The code calculates the MAE, MAPE, MSE, and RMSE for the given true and predicted values, then returns these metrics in a formatted DataFrame for easy interpretation and comparison.

In [None]:
def calculate_metrics(y_test:np, predictions:np, model_name:str, decimal_places=2):
    """
    Calculates and returns a DataFrame with regression metrics: MAE, MAPE, MSE, and RMSE.

    Parameters:
    y_test (array-like): True values.
    predictions (array-like): Predicted values.
    model_name (str): Name of the model.
    decimal_places (int, optional): Number of decimal places to display. Default is 2.

    Returns:
    pd.DataFrame: DataFrame containing the calculated metrics.
    """
    # Calculate metrics
    mae = metrics.mean_absolute_error(y_test, predictions)
    mape = metrics.mean_absolute_percentage_error(y_test, predictions)
    mse = metrics.mean_squared_error(y_test, predictions)
    rmse = np.sqrt(mse)

    # Create a DataFrame with the metrics
    data = {"Modelo": [model_name], 'MAE': [mae], 'MAPE': [mape], 'MSE': [mse], 'RMSE': [rmse]}
    df_metrics = pd.DataFrame(data)

    # Format the DataFrame to show specified decimal places
    pd.options.display.float_format = f'{{:.{decimal_places}f}}'.format

    return df_metrics

### *unSupervisedCluster*

The **`unSupervisedCluster()`** function leverages the KMeans algorithm for clustering analysis. It serves dual purposes based on the specified motive:

1. Analysis: Visualizing the relationship between the number of clusters (k) and clustering metrics (inertia and silhouette score).
2. Clustering: Assigning each data point to a cluster and returning the cluster assignments.

In [None]:
def unSupervisedCluster(df:pd, motive='analisys', range=20 ,k=3 ):
    
    '''
    Function:
    -----------

    This function works with the unsupervised model of Kmeans, and its objective is to show you how depending the number of
    clusters that you want the inertia and the silhouette score are going to go up or down to facilitate your choose oof k, and also
    have the model of Kmeans to see thoose clusters.


    Parameters:
    -----------
    df: Pandas DataFrame
        Data that the function is going to analyze
    motive: str
        Depend in wich word you use the function is going to ralize different things, for example 'Analysis' show you 2 graphs
        and 'clustering' give you in wich cluster is every target
    Range: int
        Range of k's that are in the graph showing the inertia and the silhouette score for each one of them
    K: int
        number that indicates how much clusters do you want in the modeling of Kmeans
    Returns:
    -----------
    Pandas DataFrame
        The function returns a dataframe with an aditional column wich have in wich cluster each target is in

    '''
    if motive=='analisys':
        km_list = [KMeans(n_clusters=a, random_state=42).fit(df) for a in range(2,range)]
        inertias = [model.inertia_ for model in km_list]
        silhouette_score_list = [silhouette_score(df, model.labels_) for model in km_list]

        plt.figure(figsize=(20,5))

        plt.subplot(121)
        sns.set(rc={'figure.figsize':(10,10)})
        plt.plot(range(2,range), inertias)
        plt.xlabel('k')
        plt.ylabel("inertias")
        sns.despine()

        plt.subplot(122)
        sns.set(rc={'figure.figsize':(10,10)})
        plt.plot(range(2,range), silhouette_score_list)
        plt.xlabel('k')
        plt.ylabel("silhouette_score")
        sns.despine()

    if motive =='clustering':
        kmeans = KMeans(n_clusters=k,n_init=10, random_state=42).fit(df)
        df_clusters = pd.DataFrame(kmeans.labels_, columns=['Cluster'])
        return df_clusters

### *gradient_boosting_regression*

The **`gradient_boosting_regression()`** function applies Gradient Boosting regression to a given dataset. It trains a Gradient Boosting model and computes various performance metrics to evaluate the model's predictive power.

The function returns a dictionary containing the model, predictions, and performance metrics.

In [None]:
def gradient_boosting_regression(df:pd, target_column:str,test_size=0.2,random_state=42,n_estimators=100,learning_rate=0.1):
    """
    Performs Gradient Boosting regression on the given dataset.

    Parameters:
    df (DataFrame): The DataFrame containing the data.
    target_column (str): The name of the target column.
    test_size (float, optional): The proportion of the dataset to include in the test split. Default is 0.2.
    random_state (int, optional): Controls the shuffling applied to the data before applying the split. Default is 42.
    n_estimators (int, optional): The number of boosting stages to be run. Default is 100.
    learning_rate (float, optional): Learning rate shrinks the contribution of each tree by learning_rate. Default is 0.1.

    Returns:
    dict: A dictionary containing the model, predictions, MAE, MAPE, MSE, RMSE, and R^2 score.
    """
    # Splitting the data into features (X) and target (y)
    X = df.drop(columns=[target_column])
    y = df[target_column]

    # Splitting the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)

    # Creation and training the Gradient Boosting model
    model = GradientBoostingRegressor(n_estimators=n_estimators, learning_rate=learning_rate, random_state=random_state)
    model.fit(X_train, y_train)

    # Predictions
    y_pred = model.predict(X_test)

    # Performance metrics
    mae = mean_absolute_error(y_test, y_pred)
    mape = mean_absolute_percentage_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_test, y_pred)

    # Results
    return {
        "model": model,
        "predictions": y_pred,
        "MAE": mae,
        "MAPE": mape,
        "MSE": mse,
        "RMSE": rmse,
        "R2_score": r2
    }

### *xgboost_regression*

The **`xgboost_regression()`** function applies XGBoost regression to a given dataset. It trains an XGBoost regression model and computes various performance metrics to evaluate the model's predictive power.

The function returns a dictionary containing the model, predictions, and performance metrics.

In [None]:
def xgboost_regression(df:pd, target_column:str, test_size=0.2, random_state=42, n_estimators=100, learning_rate=0.1):
    """
    Performs XGBoost regression on the given dataset.

    Parameters:
    df (DataFrame): The DataFrame containing the data.
    target_column (str): The name of the target column.
    test_size (float, optional): The proportion of the dataset to include in the test split. Default is 0.2.
    random_state (int, optional): Controls the shuffling applied to the data before applying the split. Default is 42.
    n_estimators (int, optional): The number of boosting stages to be run. Default is 100.
    learning_rate (float, optional): Learning rate shrinks the contribution of each tree by learning_rate. Default is 0.1.

    Returns:
    dict: A dictionary containing the model, predictions, MAE, MAPE, MSE, RMSE, and R^2 score.
    """
    # Splitting the data into features (X) and target (y)
    X = df.drop(columns=[target_column])
    y = df[target_column]

    # Splitting the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)

    # Creating and training the XGBoost model
    model = XGBRegressor(n_estimators=n_estimators, learning_rate=learning_rate, random_state=random_state)
    model.fit(X_train, y_train)

    # Predictions
    y_pred = model.predict(X_test)

    # Performance metrics
    mae = mean_absolute_error(y_test, y_pred)
    mape = mean_absolute_percentage_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_test, y_pred)

    # Results
    return {
        "model": model,
        "predictions": y_pred,
        "MAE": mae,
        "MAPE": mape,
        "MSE": mse,
        "RMSE": rmse,
        "R2_score": r2
    }

### *most_common_words*

The **`most_common_words()`** function extracts the most common words from a list of texts in a specified language while excluding stop words. It utilizes the CountVectorizer from scikit-learn to tokenize the texts and count the occurrences of each word.

* It might be beneficial to handle case sensitivity by converting all texts to lowercase before tokenization to ensure accurate word counts.
* Providing an option to return the most common words as a dictionary or list instead of printing them directly could enhance the function's versatility.

In [None]:
def most_common_words(texts:list, nwords:int, language:str):

    '''
    From a list of texts returns n most common words in a given language excluding stop words.

    Parameters:
    texts (list): List of texts
    nwords (int): number of most common words
    language (str): language of the text

    Return: print n most common worlds with their quantity

    Example:

    >>> most_common_words(['I hate cats', 'I love dogs', 'My dog love cats'], 3, 'english')
    Most common words:
    cats: 2
    love: 2
    dog: 1
    '''
    vectorizer_count = CountVectorizer(max_features=nwords, stop_words=language)
    texts = vectorizer_count.fit_transform(texts)
    vocabulary = vectorizer_count.vocabulary_
    most_common_words = {word: texts[:, index].sum() for word, index in vocabulary.items()}
    most_common_words = sorted(most_common_words.items(), key=lambda x: x[1], reverse=True)
    print("Most common words:")
    for word, frequency in most_common_words:
        print(f'{word}: {frequency}')

### *y_generator*

The **`y_generator()`** function categorizes a set of images into different classes based on their names. It maps each image to one or more labels provided in the labels parameter.

Returns a list of arrays representing the labels for each image. Each array is a one-hot encoded representation of the labels.


In [None]:
def y_generator(path, labels, separator):

    ''' 
    Categorize a set of images for the training of multi-class learning models.
    Based on the image name.
    path[str]: Path to the folder.
    labels[list]: Possible output labels.
    separator[str]: Separator in the image name.
     
    '''

    y = []

    for i in os.listdir(path):
        s = i
        if separator:
            s = i.split(separator)
        j = labels
        c =""
        for w in s:
            for r in j:
                if r.lower() in w.lower():
                    c = w
                for x,y in enumerate(labels):
                    if y == c:
                        arr = np.zeros(len(labels))
                        arr[x] = 1
                        y.append(arr)
    return y

### *random_forest_regression*

The **`random_forest_regression()`** function conducts Random Forest regression analysis on a provided dataset to predict a target variable. 

The proportion of the dataset to be used for testing. Default is 0.2. The seed used by the random number generator. Default is 42.
The number of trees in the Random Forest. Default is 100. In any of this parameters, the values can be change for any other. 

Returns a dictionary containing the model object, predictions, and various performance metrics.

In [None]:
def random_forest_regression(df, target_column,
                              test_size=0.2,
                                random_state=42,
                                  n_estimators=100):
    """
    Performs Random Forest regression on the given dataset.

    Parameters:
    df (DataFrame): The DataFrame containing the data.
    target_column (str): The name of the target column.
    test_size (float, optional): The proportion of the dataset to include in the test split. Default is 0.2.
    random_state (int, optional): Controls the shuffling applied to the data before applying the split. Default is 42.
    n_estimators (int, optional): The number of trees in the forest. Default is 100.

    Returns:
    dict: A dictionary containing the model, predictions, MAE, MAPE, MSE, RMSE, and R^2 score.
    """
    # Splitting the data into features (X) and target (y)
    X = df.drop(columns=[target_column])
    y = df[target_column]

    # Splitting the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)

    # Creating and training the Random Forest model
    model = RandomForestRegressor(n_estimators=n_estimators, random_state=random_state)
    model.fit(X_train, y_train)

    # Predictions
    y_pred = model.predict(X_test)

    # Performance metrics
    mae = mean_absolute_error(y_test, y_pred)
    mape = mean_absolute_percentage_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_test, y_pred)

    # Results
    return {
        "model": model,
        "predictions": y_pred,
        "MAE": mae,
        "MAPE": mape,
        "MSE": mse,
        "RMSE": rmse,
        "R2_score": r2
    }