# Predict Purity and Price of Honey - Exploratory Data Analysis (EDA)
#### Project Summary:
This project deals with the Exploratory Data Analysis (EDA) process to understand the relationship between the purity and price of honey. The “Predict Purity and Price of Honey” dataset was used to analyze various properties of honey and their effects on purity and price. Inferences were made on various categorical and numerical variables in the dataset and relationships were examined through visualizations and statistical analysis.

#### About Dataset:
The dataset contains attributes that provide information about the purity and price of honey. The attributes include physical parameters such as pH, color, density. Studies have been conducted on how these parameters can affect the purity and price of honey.

- **CS (Color Score):**
    - Represents the color score of the honey sample, ranging from 1.0 to 10.0. Lower values indicate a lighter color, while higher values indicate a darker color.

- **Density:**
    - Represents the density of the honey sample in grams per cubic centimeter at 25°C, ranging from 1.21 to 1.86.

- **WC (Water Content):**
    - Represents the water content in the honey sample, ranging from 12.0% to 25.0%.

- **pH:**
    - Represents the pH level of the honey sample, ranging from 2.50 to 7.50.

- **EC (Electrical Conductivity):**
    - Represents the electrical conductivity of the honey sample in milliSiemens per centimeter.

- **F (Fructose Level):**
    - Represents the fructose level of the honey sample, ranging from 20 to 50.

- **G (Glucose Level):**
    - Represents the glucose level of the honey sample, ranging from 20 to 45.

- **Pollen_analysis:**
    - Represents the floral source of the honey sample. Possible values include Clover, Wildflower, Orange Blossom, Alfalfa, Acacia, Lavender, Eucalyptus, Buckwheat, Manuka, Sage, Sunflower, Borage, Rosemary, Thyme, Heather, Tupelo, Blueberry, Chestnut, and Avocado.

- **Viscosity:**
    - Represents the viscosity of the honey sample in centipoise, ranging from 1500 to 10000. Viscosity values between 2500 and 9500 are considered optimal for purity.

- **Purity:**
    - The target variable represents the purity of the honey sample, ranging from 0.01 to 1.00.

- **Price:**
    - The calculated price of the honey.


### Objective:
The main objective in this project is to analyze the relationship between the purity levels and the price of honey and determine which factors have more influence on these variables. Furthermore, this EDA process aims to uncover missing data and potential anomalies in the dataset.

#### Studies Conducted:
- #### Data Cleaning and Preprocessing:

    - Missing data checking and cleaning.
    - Organizing and converting categorical data into numerical data.
    - Analysis of outliers in data.
- ### Exploratory Data Analysis (EDA):
    
    - Extraction of basic statistics from the data set (mean, median, standard deviation, etc.).
    - Visualization of categorical data (pH levels, purity groups, etc.).
    - Correlation analysis of numerical data and examination of their distributions.
    - Comparison of purity distribution according to different pH groups.
- ### Visualizations:

    - Visualization of distributions and relationships of variables using graphs such as box plots, histograms and scatter plots.
    - Visual analysis of categorical variables using pie charts and bar charts.
- #### Statistical Tests:

    - ANOVA test to compare purity averages across pH groups.
    - Analyzing the effects of different pH ranges on the purity and price of honey.
- #### Results:
    - A significant relationship was observed between pH value and purity.
    - When the purity averages of honey in different pH groups were compared, it was found that some pH ranges had higher purity levels than others.
    - There was a linear relationship between price and purity, with higher purity honey generally being more expensive.
- ### Conclusions:
    This EDA process provides valuable insights into understanding the relationship of honey purity and price with certain characteristics. In particular, the effects of pH and other physical properties on honey purity and price provide important decision support tools for honey producers and consumers.

### Next Steps:
This analysis can serve as a basis for further modeling studies (e.g., linear regression or machine learning algorithms) and predictive analyses to estimate honey purity. Furthermore, more sophisticated trait engineering and modeling methods could be applied for honey price prediction.



<div style="width: auto; height: auto; overflow: hidden; text-align: center;">
  <img src="https://png.pngtree.com/thumb_back/fw800/background/20240716/pngtree-on-a-white-background-honeycomb-with-honey-drops-image_16002385.jpg" alt="Melting Honey" style="width: 100%; height: 100%; object-fit: cover;">
</div>


# Import Libraries

In [None]:
# Basic Libraries
import pandas as pd # Data Manipulation
import numpy as np # Mathematical Operations


# Data Virtualization
import matplotlib.pyplot as plt
import seaborn as sns

# statistic
import joblib
from sklearn.metrics import r2_score

import random
import time

import xgboost as xgb
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
# filter
import warnings
warnings.filterwarnings("ignore")


ModuleNotFoundError: No module named 'pandas'

In [None]:
def set_global_random_seed(seed: int = 42):
    """Sets a global random seed for reproducibility across various libraries."""
    # Set Python's built-in random module
    random.seed(seed)
    
    # Set NumPy random seed
    np.random.seed(seed)

# Seed Ayarlama
set_global_random_seed(seed=7)


# Load and Read Dataset

In [None]:
data = pd.read_csv("/kaggle/input/predict-purity-and-price-of-honey/honey_purity_dataset.csv")

# First Look at the Dataset

In [None]:
def add_random_missing_values(dataframe: pd.DataFrame,
                              missing_rate: float = 0.05,
                              random_state: int = 42,
                              exclude_columns: list = None) -> pd.DataFrame:
    """Turns random values to NaN in a DataFrame with reproducibility, excluding specified columns.

    Args:
        dataframe (pd.DataFrame): DataFrame to be processed.
        missing_rate (float): Percentage of missing value rate in float format. Defaults 0.05.
        random_state (int): Seed for random number generator. Defaults 42.
        exclude_columns (list): List of column names to exclude from missing value assignment.

    Returns:
        df_missing (pd.DataFrame): Processed DataFrame object.

    """
    # Set random seed for reproducibility
    np.random.seed(random_state)

    # Get copy of dataframe
    df_missing = dataframe.copy()

    # Determine columns to include
    if exclude_columns:
        columns_to_include = [col for col in dataframe.columns if col not in exclude_columns]
    else:
        columns_to_include = dataframe.columns.tolist()

    # Ensure we have at least one column to modify
    if not columns_to_include:
        raise ValueError("All columns are excluded. No missing values can be added.")

    # Obtain size of dataframe and number of total missing values
    df_size = len(columns_to_include) * dataframe.shape[0]
    num_missing = int(df_size * missing_rate)

    # Map included column indices
    included_col_indices = [dataframe.columns.get_loc(col) for col in columns_to_include]

    # Generate random row and column indexes
    row_indices = np.random.randint(0, dataframe.shape[0], num_missing)
    col_indices = np.random.choice(included_col_indices, num_missing)

    # Turn selected values into NaN
    for row_idx, col_idx in zip(row_indices, col_indices):
        df_missing.iat[row_idx, col_idx] = np.nan

    return df_missing


# Veri çerçevesine %0.05 NaN ekle
df = add_random_missing_values(data, missing_rate=0.005,random_state=42,
                               exclude_columns=["Pollen_analysis","Price"])
df.head() # Gets the first 5 rows of the data set


In [None]:
df.tail() # Gets the last 5 rows of the data set

In [None]:
df.sample(10) # Gets the random 10 rows of the data set

* When we first look at the data set, we see that the columns except the **Pollen_analysis** column contain numeric values.
* We see that there are missing values in the data set.

In [None]:
print("Veri Setinin Boyutu: ", df.shape) # row,columns

In [None]:
df.info() # Summary about the data set

### Let's look at the missing values!

The **isna()** method checks whether the objects of a Data Frame or Series contain missing or null values (NA, NaN) and returns a new object with the same shape as the original, but whose elements have boolean values True or False. True indicates the presence of null or missing values, False indicates otherwise.

In [None]:
# Create a DataFrame showing the missing data
missing = pd.DataFrame(df.isna().sum()).rename(columns={0:"Miss_values"}) 
# Create Missing Percent Column
missing["Miss_Percent"] = missing["Miss_values"] / len(df)

missing.head() 

In [None]:
# DataFrame with only missing values
only_miss_val = missing[missing["Miss_values"]> 0]
only_miss_val

In [None]:
plt.figure(figsize=(8,4))
sns.barplot(data = only_miss_val , y=only_miss_val.index, x="Miss_values")
plt.title("Missing Values with Barplot")
plt.xlabel("Number of Missing Values")
plt.tight_layout()
plt.show()

### Are there any duplicate records?

The **duplicated()** method returns a Series with True and False values that define which rows in the DataFrame are duplicated and which are not. Use the subset parameter to specify which columns to include when searching for duplicates. By default all columns are included.

In [None]:
# Number of duplicate records
print("Number of duplicate records in the dataset: ",df.duplicated().sum()) 

### Number of Unique Values for Each Feature

The **unique()** method returns the sorted unique elements of an array. In addition to the unique elements there are three optional outputs: 
* The indices of the input array that yields the unique values
* The indices of the unique array that reconstructs the input array
* The number of times each unique value occurs in the input array

In [None]:
def unique_values(df:pd.DataFrame) -> None:
    """
    Calculates and prints the number of unique values in each column of a pandas DataFrame.

    Parameters:
    -----------
    df : pd.DataFrame
        The DataFrame for which the unique values will be calculated.

    Returns:
    --------
    None
        This function does not return anything; it prints the number of unique values 
        for each column.

    """
    unique_df = df.copy()
    for col in df.columns:
        number_of_unique = df[col].nunique()
        print(f"Number of unique values in column {col}: {number_of_unique}")
        print("=="*22)
unique_values(df)

In [None]:
# Summary Function
def get_summary(df: pd.DataFrame) -> pd.DataFrame:
    """
    Generates a summary of the given DataFrame, including descriptive statistics and additional information.

    This function returns a summary DataFrame with the following columns for each feature:
    - Count: Non-missing entries in each column.
    - unique: Number of unique values in each column.
    - missing: Number of missing values in each column.
    - duplicated: Total number of duplicate rows in the DataFrame.
    - mean: Mean of the column (for numerical columns only).
    - std: Standard deviation of the column (for numerical columns only).
    - min: Minimum value in the column (for numerical columns only).
    - 25%: 25th percentile of the column (for numerical columns only).
    - 50%: 50th percentile (median) of the column (for numerical columns only).
    - 75%: 75th percentile of the column (for numerical columns only).
    - max: Maximum value in the column (for numerical columns only).
    
    Args:
        df (pd.DataFrame): The input DataFrame to be summarized.

    Returns:
        pd.DataFrame: A styled DataFrame with descriptive statistics and additional information,
                      including background color gradients and borders for visual emphasis.
    """
    df_desc = pd.DataFrame(df.describe(include="all").T)
    df_summary = pd.DataFrame({"dtype": df.dtypes,
                               "Count": df.count(),
                               "unique": df.nunique(),
                               "missing": df.isna().sum(),
                               "duplicated": df.duplicated().sum(),
                               "mean": df_desc["mean"].values,
                               "min": df_desc["min"].values,
                               "std": df_desc["std"].values,
                               "25%": df_desc["25%"].values,
                               "50%": df_desc["50%"].values,
                               "75%": df_desc["75%"].values,
                               "max": df_desc["max"].values})

    return df_summary.style\
        .background_gradient(cmap='YlGnBu', subset=['mean', 'std', 'min', '25%', '50%', '75%', 'max'])\
        .set_properties(**{'border': '1.5px solid black'})


In [None]:
get_summary(df) # Summary

## **Dataset Description Based on First Step:**
Below is a detailed description of each column in the dataset:

1. **CS**
   - **Type**: Numerical
   - **Description**: Represents the color score of the honey sample. Lower values indicate a lighter color, while higher values indicate a darker color.
   - **Summary Statistics**:
     - **Mean**: 5.49
     - **Standard Deviation**: 2.59
     - **Minimum**: 1.00
     - **25th Percentile**: 3.26
     - **Median**: 5.50
     - **75th Percentile**: 7.74
     - **Maximum**: 10
2. **Density**
   - **Type**: Numerical
   - **Description**: Represents the density of the honey sample in grams per cubic centimeter at 25°C.
   - **Summary Statistics**:
     - **Mean**: 1.53
     - **Standard Deviation**: 0.18
     - **Minimum**: 1.21
     - **25th Percentile**: 1.37
     - **Median**: 1.54
     - **75th Percentile**: 1.70
     - **Maximum**: 1.86
3. **WC**
   - **Type**: Numerical
   - **Description**: Represents the water content in the honey sample.
   - **Summary Statistics**:
     - **Mean**: 18.50
     - **Standard Deviation**: 3.74
     - **Minimum**: 12
     - **25th Percentile**: 15.26
     - **Median**: 18.51
     - **75th Percentile**: 21.75
     - **Maximum**: 25
4. **pH**
   - **Type**: Numerical
   - **Description**: Represents the pH level of the honey sample.
   - **Summary Statistics**:
     - **Mean**: 4.99
     - **Standard Deviation**: 1.44
     - **Minimum**: 2.5
     - **25th Percentile**: 3.74
     - **Median**: 4.99
     - **75th Percentile**: 6.25
     - **Maximum**: 7.50 
5. **EC**
   - **Type**: Numerical
   - **Description**: Represents the electrical conductivity of the honey sample in milliSiemens per centimeter.
   - **Summary Statistics**:
     - **Mean**: 0.79
     - **Standard Deviation**: 0.05
     - **Minimum**: 0.70
     - **25th Percentile**: 0.75
     - **Median**: 0.80
     - **75th Percentile**: 0.85
     - **Maximum**: 0.90
6. **F**
   - **Type**: Numerical
   - **Description**: Represents the fructose level of the honey sample.
   - **Summary Statistics**:
     - **Mean**: 34.97
     - **Standard Deviation**: 8.65
     - **Minimum**: 20.00
     - **25th Percentile**: 27.46
     - **Median**: 34.97
     - **75th Percentile**: 44.47
     - **Maximum**: 50.00   
7. **G**
   - **Type**: Numerical
   - **Description**: Represents the glucose level of the honey sample.
   - **Summary Statistics**:
     - **Mean**: 32.50
     - **Standard Deviation**: 7.22
     - **Minimum**: 20.00
     - **25th Percentile**: 26.23
     - **Median**: 32.50
     - **75th Percentile**: 38.76
     - **Maximum**: 45.00
8. **Pollen_analysis**
   - **Type**: Categorical
   - **Description**: Represents the floral source of the honey sample.

9. **Viscosity**
   - **Type**: Numerical
   - **Description**: Represents the viscosity of the honey sample in centipoise.Viscosity values between 2500 and 9500 are considered optimal for purity.
   - **Summary Statistics**:
     - **Mean**: 5752.32
     - **Standard Deviation**: 2455.72
     - **Minimum**: 1500.05
     - **25th Percentile**: 3627.17
     - **Median**: 5752.66
     - **75th Percentile**: 7886.04
     - **Maximum**: 9999.97
10. **Purity**
   - **Type**: Numerical
   - **Description**: Represents the viscosity of the honey sample in centipoise.
   - **Summary Statistics**:
     - **Mean**: 0.82
     - **Standard Deviation**: 0.13
     - **Minimum**: 0.61
     - **25th Percentile**: 0.66
     - **Median**: 0.82
     - **75th Percentile**: 0.97
     - **Maximum**: 1.00
11. **Price**
   - **Type**: Numerical
   - **Description**: Represents the viscosity of the honey sample in centipoise.
   - **Summary Statistics**:
     - **Mean**: 594.80
     - **Standard Deviation**: 233.62
     - **Minimum**: 128.72
     - **25th Percentile**: 433.00
     - **Median**: 612.96
     - **75th Percentile**: 770.22
     - **Maximum**: 976.69



# Exploratory Data Analysis (EDA) 

### Separation of Categorical and Numerical variables

In [None]:
# List Comprehension for numeric features
numerical_columns = [col for col in df.columns if df[col].dtypes in ["float","int"]]
# List Comprehension for categoric features
categorical_columns = [col for col in df.columns if df[col].dtypes in ["object","category"]]

print(f"Number of columns containing numeric values: {len(numerical_columns)}")
print(f"Columns containing numeric values: {numerical_columns}")
print()
print(f"Number of columns containing categorical values: {len(categorical_columns)}")
print(f"Columns containing categorical values: {categorical_columns}")

# Descriptive Analysis

## Target Distribution
First, let's look at the distribution of the target variable. For this, we will use the **histogram** graph.

We will also use **boxplot** to see the outliers

In [None]:
# Create a figure
fig = plt.figure(figsize=(12, 6))

# Add axes for boxplot
box_axes = fig.add_axes([0.1, 0.1, 0.35, 0.8])  # [left, bottom, width, height]
sns.kdeplot(x=df["Price"], color='blue', ax=box_axes)
box_axes.set_title('KDE Plot', fontsize=14)
box_axes.set_xlabel('Price', fontsize=12)

# Add axes for histogram with KDE
hist_axes = fig.add_axes([0.55, 0.1, 0.35, 0.8])  # [left, bottom, width, height]
sns.distplot(
    x=df["Price"],kde=False,
    color='#967bb6', ax=hist_axes,
)
hist_axes.set_title('Distribution', fontsize=14)
hist_axes.set_xlabel('Price', fontsize=12)
hist_axes.set_ylabel('Frequency', fontsize=12)

# Show the plot
plt.show()

1. **Implications of the Price Histogram:**
- **Price Distribution:**

    - The histogram clearly shows how honey is distributed across price ranges. This determines which price ranges are prevalent in the market.
    - For example, a large peak around 600 on the chart may indicate that honey is most commonly sold or priced within that price range.

- **Segmentation:**

    -  Multiple peaks (mode) may indicate that honey products are divided into price segments
    - Cheap (for example, low quality or additive-containing products),
    - Mid-priced (more common quality products),
    - Premium (high quality or natural honey).

## Correlation 
**Correlation matrix** is a statistical technique used to evaluate the relationship between two variables in a data set. The matrix is ​​a table in which each cell contains a correlation coefficient, where **1** is considered a strong relationship between the variables, **0** is considered a neutral relationship, and **-1** is considered a weak relationship.

In [None]:
# Shows the degree to which the Price value is linearly related to other features
df.corr(numeric_only=True)["Price"].sort_values()

In [None]:
# Correlation matrix
corr_matrix = df.corr(numeric_only=True)

# Create a mask for the upper triangle
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))

# Create heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(
    corr_matrix, 
    mask=mask, 
    annot=True, # Show correlation values
    fmt=".3f", # Value format
    cmap='coolwarm', # Color scale
    vmin=-1, vmax=1, # Correlation range
    linewidths=0.5, # Cell separator width
    cbar_kws={'shrink': 0.8} # Color bar size
    )

plt.title('Correlation Matrix (Upper Triangle Hidden)', fontsize=16)
plt.tight_layout()
plt.show()


* **Pollen_analysis** is highly overall correlated with **Price**


### Purity ve Price 

In [None]:
plt.figure(figsize=(12,8))
sns.scatterplot(df,x="Purity",y = "Price",hue="Pollen_analysis",palette="tab10")
plt.title("Scatter by Purity & Price") 
plt.xlabel("Purity") 
plt.ylabel("Price") 
# Set Legend position to top left
plt.legend(loc=[0.23, 0.295], title="Pollen Analysis")

plt.show()

In [None]:
group_df = df.copy()
# Segmentation according to purity values
group_df['Purity_group'] = pd.cut(group_df['Purity'], bins=[0.6, 0.8, 0.9, 1.0], labels=['Low', 'Medium', 'High'])

# Average pH value for each group
purity_group_means = group_df.groupby('Purity_group')['Price'].mean()

print(purity_group_means)


In [None]:
# Visualize Price averages with bar plot
purity_group_means.plot(kind='bar')
plt.title('Purity Gruplarına Göre Price Ortalamaları')
plt.ylabel('Price Ortalaması')
plt.xlabel('Purity Grubu')
plt.xticks(rotation=0)
plt.show()


In [None]:
# Counting by purity groups
purity_group_counts = group_df['Purity_group'].value_counts()

# Create Pie plot
plt.figure(figsize=(6, 4))
plt.pie(purity_group_counts, labels=purity_group_counts.index, autopct='%1.1f%%', startangle=90, colors=['#ff9999','#66b3ff','#99ff99'])
plt.title('Purity Grubuna Göre Dağılım')
plt.axis('equal') 
plt.show()


In [None]:
plt.figure(figsize=(6, 4))
plt.pie(purity_group_means, labels=purity_group_means.index, autopct='%1.1f%%', startangle=90, colors=['#ff9999','#66b3ff','#99ff99'])
plt.title('Purity Grubuna Göre Ortalama Price Dağılımı')
plt.axis('equal') 
plt.show()

#### Correlation, Target Variable, and Insights from the Purity Distribution of the Target Variable:

#### 1. Honey Purity and Price Distribution
* **Multimodal Distribution:** Multiple peaks (modes) in the graph can indicate different groups in the honey purity or price:
   - This could suggest honey groups with varying purity levels (e.g., 100% natural, adulterated, low quality).
   - It might also reflect different price ranges (low-priced, mid-priced, premium).
     - For example: The pronounced peak around 600 could indicate a common honey type in this price range, whether it’s pure or widespread.

#### 2. Density and Segments
   The presence of peak points in different regions suggests clear segmentation in honey purity and price:
   - **Low-Priced and Low Purity:** If concentration is observed in lower ranges, it may represent honey groups with added substances.
   - **High-Priced and High Purity:** Peaks in higher price ranges might indicate premium and natural honey products.

#### 3. Distribution Range
   - The data spreads across a broad range (from 200 to 1000). This implies significant variability in honey prices and purity levels.
   - Possible explanations:
     - Different honey types (e.g., flower honey, chestnut honey, multifloral honey) or production methods (organic, conventional) could contribute to this variability.
     - Regional differences (e.g., imported honey vs. local honey) might cause such diversity.

#### 4. Anomalies
   - **Concentration Ranges:** The clear concentration around 600 suggests that honey in this range is widespread in terms of price/purity. This could indicate that the data collection process focused on a specific segment of the market.
   - **Possible Outliers:** Irregularities in the KDE curve may point to anomalies or unexpected behaviors.
     - For example: Honey with very low purity but high prices may represent cases where the price doesn’t reflect the quality.


### pH ve Purity

In [None]:
plt.figure(figsize=(12,8))
sns.scatterplot(df,x="Purity",y = "pH")
plt.title("Scatter by Purity & Price") 
plt.xlabel("Purity") 
plt.ylabel("pH") 

plt.show()

**Possible Research Questions:**

- Why are purity levels clustered at regular pH values?
- Is there a direct impact of pH on safety?
- Are higher purity levels consistent with specific pH values?

In [None]:
# Segmentation according to pH values
group_df['pH_group'] = pd.cut(group_df['pH'], bins=[2, 4, 6, 8], labels=['Low', 'Medium', 'Normal'])

# Average purity for each group
group_means = group_df.groupby('pH_group')['Purity'].mean()
print(group_means)


In [None]:
group_means.plot(kind='bar')
plt.title('pH Gruplarına Göre Purity Ortalamaları')
plt.ylabel('pH Ortalaması')
plt.xlabel('Purity Grubu')
plt.xticks(rotation=0)
plt.show()


In [None]:
# Counting by purity groups
ph_group_counts = group_df['pH_group'].value_counts()

plt.figure(figsize=(6, 4))
plt.pie(ph_group_counts, labels=ph_group_counts.index, autopct='%1.1f%%', startangle=90, colors=['#ff9999','#66b3ff','#99ff99'])
plt.title('pH Grubuna Göre Dağılım')
plt.axis('equal')  
plt.show()


In [None]:
plt.figure(figsize=(6, 4))
plt.pie(group_means, labels=group_means.index, autopct='%1.1f%%', startangle=90, colors=['#ff9999','#66b3ff','#99ff99'])
plt.title('pH Grubuna Göre Purity Dağılımı')
plt.axis('equal')  
plt.show()

p_value < 0.05 olduğundan pH değeri farklı olan gruplar arasında saflık açısından anlamlı bir fark vardır.

### Viscosity

In [None]:
# Data set containing the optimal purity value
vis_df = group_df.loc[(group_df["Viscosity"] >= 2500) & (group_df["Viscosity"] <= 10000)]["Purity_group"].value_counts()

In [None]:
plt.figure(figsize=(7,4))
vis_df.plot(kind="bar",color="skyblue")
plt.xlabel("Purity Group")
plt.ylabel("Viscosity Count")
plt.title("Viscosity Number by Purity Type")
plt.show()

#### Inference:

Upon examining the dataset, it is mentioned that the optimal viscosity range is between 2500 and 10000. However, upon closer inspection, we observe records with high viscosity values but low purity values.

#### Possible Causes and Comments:

- **Additives or Processing Effects:**

    - Additives (such as sugar syrup, starch, or gelatin) might have been added to increase the viscosity of the honey. This could reduce the purity of the honey.
    - Honey that has undergone industrial processing, such as concentration, could increase its viscosity but lose its natural structure and purity.

-  **Low Moisture Content:**

      - The honey may have low moisture content. While low moisture can increase viscosity, it doesn't necessarily guarantee high purity. If purity is low, low moisture could just be a physical characteristic.

-  **Non-Floral Properties:**

      - Non-natural feeding conditions (e.g., feeding bees sugar syrup) or adulterated honey production can lead to high viscosity but low purity.

-  **Carbohydrate Composition Changes:**

      - Alterations in the glucose and fructose ratio in the honey, in unnatural proportions, might increase viscosity while decreasing purity.

-   **Crystallization (Granulation):**

      - The increase in viscosity might be due to crystallization. However, crystallization is a natural process. If purity is low, crystallization might be due to additives rather than natural contents.


# Handling Missing Values and Outliers

Since the **Mean** and **Median** values are very close to each other, we can use either one.

### Purity

In [None]:
fill_df = df.copy()
# Calculates the average purity values ​​for each pollen type
purity_mean_df = df.groupby(["Pollen_analysis"])["Purity"].mean().reset_index()

# Filling in missing values with the average of purity values for each pollen type
for row in zip(purity_mean_df["Pollen_analysis"], purity_mean_df["Purity"]):
    fill_df.loc[((fill_df["Purity"].isna()) & (fill_df["Pollen_analysis"] == row[0])), "Purity"] = row[1]

In [None]:
fill_group_df = fill_df.copy()
fill_group_df['Purity_group'] = pd.cut(fill_group_df['Purity'], bins=[0.6, 0.8, 0.9, 1.0], labels=['Low', 'Medium', 'High'])

### pH

In [None]:
# Calculates the average pH values for each pollen type and purity group
ph_mean_df = fill_group_df.groupby(["Pollen_analysis","Purity_group"])["pH"].mean().reset_index()


# Filling in missing values with the average of pH values for each pollen type and purity group
for row in zip(ph_mean_df["Pollen_analysis"], ph_mean_df["Purity_group"],ph_mean_df["pH"]):
    fill_group_df.loc[((fill_group_df["pH"].isna()) & (fill_group_df["Pollen_analysis"] == row[0]) & 
                       (fill_group_df["Purity_group"]==row[1])), "pH"] = row[2]
    

In [None]:
fill_group_df['pH_group'] = pd.cut(fill_group_df['pH'], bins=[2, 4, 6, 8], labels=['Low', 'Medium', 'Normal'])

### Viscosity

In [None]:
vis_mean_df = fill_group_df.groupby(["Pollen_analysis","Purity_group"])["Viscosity"].mean().reset_index()

for row in zip(vis_mean_df["Pollen_analysis"], vis_mean_df["Purity_group"],vis_mean_df["Viscosity"]):
    fill_group_df.loc[((fill_group_df["Viscosity"].isna()) & (fill_group_df["Pollen_analysis"] == row[0]) & 
                       (fill_group_df["Purity_group"]==row[1])), "Viscosity"] = row[2]

In [None]:
def fill_value(df,group_df):
    """
    Fills missing values in the target dataframe (df) based on a reference dataframe (group_df).

    This function checks for missing values in the specified columns of the target dataframe 
    (df) and fills those missing values with corresponding values from the group dataframe 
    (group_df) based on certain column conditions.

    The function supports two scenarios:
    1. When `group_df` has 3 columns: it uses the first two columns to match the values and 
       fills the missing values in the third column.
    2. When `group_df` has 2 columns: it uses the first column to match the values and 
       fills the missing values in the second column.

    Args:
    - df (pandas.DataFrame): The target dataframe containing missing values that need to be filled.
    - group_df (pandas.DataFrame): The reference dataframe used to fill missing values in `df`.

    Returns:
    - pandas.DataFrame: A new dataframe (a copy of `df`) with the missing values filled.
    
    Example:
    ```python
    df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [None, 5, None, 7]})
    group_df = pd.DataFrame({'A': [1, 3], 'B': [10, 30]})
    filled_df = fill_value(df, group_df)
    ```
    In this example, missing values in column 'B' of `df` will be filled based on matching 
    values in column 'A' from `group_df`.
    """
    new_df = df.copy()
    
    columns_list = group_df.columns.to_list()
    if len(columns_list) ==3:
        for row in zip(group_df[columns_list[0]], group_df[columns_list[1]],group_df[columns_list[2]]):
            df.loc[((df[columns_list[2]].isna()) & (df[columns_list[0]] == row[0]) & 
                               (df[columns_list[1]]==row[1])), columns_list[2]] = row[2]
    elif len(columns_list) == 2:
        for row in zip(group_df[columns_list[0]], group_df[columns_list[1]]):
            df.loc[((df[columns_list[1]].isna()) & (df[columns_list[0]] == row[0])), columns_list[1]] = row[1]

    return new_df

In [None]:
fill_group_df.isna().sum()

In [None]:
g_mean_df = fill_group_df.groupby(["Pollen_analysis","Purity_group"])["G"].mean().reset_index()
f_mean_df = fill_group_df.groupby(["Pollen_analysis","Purity_group"])["F"].mean().reset_index()
ec_mean_df = fill_group_df.groupby(["Pollen_analysis","Purity_group"])["EC"].mean().reset_index()
wc_mean_df = fill_group_df.groupby(["Pollen_analysis","Purity_group"])["WC"].mean().reset_index()
cs_mean_df = fill_group_df.groupby(["Pollen_analysis","Purity_group"])["CS"].mean().reset_index()
dens_mean_df = fill_group_df.groupby(["Pollen_analysis","Purity_group"])["Density"].mean().reset_index()


liste  = [dens_mean_df,cs_mean_df,wc_mean_df,ec_mean_df,f_mean_df,g_mean_df]
for li in liste:
    fill_value(fill_group_df, li)

## Use Frequency Table and Mode for Categorical Variables


In [None]:
# List Comprehension for categoric features
categorical_columns = [col for col in fill_group_df.columns if fill_group_df[col].dtypes in ["object","category"]]

# Calculate percentages for categorical columns
category_percentages = {}

for col in categorical_columns:
    category_percentages[col] = fill_group_df[col].value_counts(normalize=True) * 100

for col, percentages in category_percentages.items():
    print(f"Percentages for {col}:")
    for category, percent in percentages.items():
        print(f"  {category}: {percent:.2f}%")
    print("\n")

In [None]:
for col in categorical_columns:
    frequency_table = fill_group_df[col].value_counts().reset_index()
    
    # Rename the columns for better readability
    frequency_table.columns = [col, 'Frequency']
    
    # Set the category names as the index
    frequency_table.set_index(col, inplace=True)
    
    # Calculate the percentage for each category
    frequency_table['Percentage'] = (frequency_table['Frequency'] / len(fill_group_df)) * 100
    
    # Print the frequency table
    print(frequency_table)
    print("\n")
    
    # Calculate the mode of the column
    mode_value = fill_group_df[col].mode()[0]
    
    # Print the mode of the column
    print(f"Mode for Pollen_analysis: {mode_value}\n")
    print("=="*40)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

for col in categorical_columns:
    frequency_table = fill_group_df[col].value_counts().reset_index()
    
    # Rename the columns for better readability
    frequency_table.columns = [col, 'Frequency']
    
    # Set the category names as the index
    frequency_table.set_index(col, inplace=True)
    
    # Calculate the percentage for each category
    frequency_table['Percentage'] = (frequency_table['Frequency'] / len(fill_group_df)) * 100
    
    # Print the frequency table
    print(frequency_table)
    print("\n")
    
    # Calculate the mode of the column
    mode_value = fill_group_df[col].mode()[0]
    
    # Print the mode of the column
    print(f"Mode for {col}: {mode_value}\n")
    print("=="*40)

    # Create a barplot to visualize the frequency of each category
    plt.figure(figsize=(8, 6))
    sns.barplot(x=frequency_table.index, y=frequency_table['Frequency'], palette='viridis')
    
    # Title and labels
    plt.title(f"Category Frequency Distribution for {col}", fontsize=14)
    plt.xlabel(col, fontsize=12)
    plt.ylabel("Frequency", fontsize=12)
    
    # Rotate x-axis labels for readability (if needed)
    plt.xticks(rotation=78)
    
    # Display the plot
    plt.tight_layout()
    plt.show()


## Metrics for Numerical Features Based on Robustness to Outliers

Understanding how different metrics handle outliers is crucial in data analysis. Below are the basic metrics for numerical features, categorized by their robustness to outliers.

### Outliers

Outliers are data points that are significantly different from other observations in the dataset. They can be a result of variability in the data or indicate measurement error. Identifying and handling outliers is crucial because they can greatly impact statistical analyses.

- **Robust to Outliers**                  
    - Median
    - Interquartile Range (IQR)
    - Median Absolute Deviation (MAD)
    - Trimmed Mean
    
- **Sensitive to Outliers**                    
    - Mean
    - Standard Deviation
    - Variance



In [None]:
# List Comprehension for numeric features
numerical_columns = [col for col in fill_group_df.columns if fill_group_df[col].dtypes in ["float","int"]]

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import scipy.stats as stats

def plot_metrics(column_data, column_name, TARGET = "Price"):
    # Creating subplots for different visualizations
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))

    # Histogram
    sns.histplot(column_data, kde=False, ax=axes[0, 0], color='skyblue')
    axes[0, 0].set_title(f"Histogram of {column_name}")
    axes[0, 0].set_xlabel(column_name)
    axes[0, 0].set_ylabel("Frequency")

    # Boxplot
    sns.boxplot(y=column_data, ax=axes[0, 1], color='lightgreen')
    axes[0, 1].set_title(f"Boxplot of {column_name}")
    axes[0, 1].set_xlabel(column_name)

    # KDE (Kernel Density Estimation) plot
    sns.scatterplot(x = column_data, y = fill_group_df[TARGET] ,ax=axes[1, 0], color='orange')
    axes[1, 0].set_title(f"Scatter plot of {column_name}")
    axes[1, 0].set_xlabel(column_name)
    axes[1, 0].set_ylabel("Price")

    # Q-Q plot
    stats.probplot(column_data, dist="norm", plot=axes[1, 1])
    axes[1, 1].set_title(f"Q-Q plot of {column_name}")

    # Adjust layout
    plt.tight_layout()
    plt.show()

# Assuming numerical_columns and fill_group_df are already defined
for col in numerical_columns:
    col_data = fill_group_df[col]
    
    # Plot the graphs for each column
    plot_metrics(col_data, col)
    print("#"*100)

In [None]:
def drop_outlier(df: pd.DataFrame, columns_list: list) -> pd.DataFrame:
    """
    Removes rows containing outliers in specified columns from the DataFrame.

    Parameters:
    df (pd.DataFrame): Input DataFrame.
    columns_list (list): List of column names to check for outliers.

    Returns:
    pd.DataFrame: DataFrame with outliers removed.
    """
    data = df.copy()
    print("Old data shape: ",data.shape) 
    
    for col in columns_list:
        q1 = data[col].quantile(0.25)
        q3 = data[col].quantile(0.75)
        IQR = q3 - q1
        fence_low = q1 - (1.5 * IQR)
        fence_high = q3 + (1.5 * IQR)
        
        # Filter rows where column values are within the fence
        data = data[(data[col] >= fence_low) & (data[col] <= fence_high)]

    print("New data shape: ",data.shape) 
    
    return data


In [None]:
no_outlier_data = drop_outlier(fill_group_df,numerical_columns)

# Feature Engineering

## Feature Screening

Feature screening is a crucial step in the data quality process that involves identifying and removing features (variables) that do not contribute meaningful information to the analysis or modeling. By eliminating such features, we can streamline the dataset, improve model performance, and enhance interpretability. In this section, we will discuss three specific criteria for feature screening:

**Features with Coefficient of Variation Less than 0.1 for Continuous Variables**

The coefficient of variation (CV) is a measure of relative variability. It is calculated as the ratio of the standard deviation to the mean. Features with a CV of less than 0.1 are considered to have low variability and may not provide important information for analysis. We will identify and remove such features.

**Features with Mode Category Percentage Greater than 95% for Categorical Variables**

Categorical variables dominated by a single category (mode category percentage > 95%) may not be useful for analysis because they do not provide much variation. We will identify and remove these categorical features to streamline the dataset.

**Features with a Percentage of Unique Categories Exceeding 90% for Categorical Variables**

Categorical variables with a high percentage of unique categories (>90%) can complicate analysis and lead to overfitting of models. We will identify and remove these features to provide a more robust and generalizable model.

By applying these elimination criteria, we can ensure that the features remaining in the dataset provide meaningful and relevant information for subsequent analyses.

In [None]:
def feature_screening(df: pd.DataFrame,
                      target: str, 
                      categorical_columns: list, 
                      numerical_columns: list) -> pd.DataFrame:
    """
    Performs feature screening on a dataset by analyzing numerical and categorical features 
    to identify and remove less informative or redundant features.

    Parameters:
    -----------
    df : pd.DataFrame
        The dataset containing features and the target variable.

    target : str
        The name of the target variable column.

    categorical_columns : list
        A list of column names corresponding to categorical features.

    numerical_columns : list
        A list of column names corresponding to numerical features.

    Returns:
    --------
    pd.DataFrame
        A cleaned dataset with the identified features removed.

    Methodology:
    ------------
    - For numerical features:
      - Calculates the Coefficient of Variation (CV) and identifies features with CV < 0.1.

    - For categorical features:
      - Identifies features where the most frequent category (mode) exceeds 95% of the data.
      - Identifies features where the percentage of unique categories exceeds 90%.

    - Combines all identified features to be removed and drops them from the dataset.

    Example:
    --------
    >>> import pandas as pd
    >>> data = {
    ...     'num1': [1, 1, 1, 1],
    ...     'num2': [2, 2, 2, 2],
    ...     'cat1': ['A', 'A', 'A', 'A'],
    ...     'cat2': ['B', 'C', 'D', 'E'],
    ...     'target': [0, 1, 0, 1]
    ... }
    >>> df = pd.DataFrame(data)
    >>> cleaned_data = feature_screening(
    ...     df=df, 
    ...     target='target', 
    ...     categorical_columns=['cat1', 'cat2'], 
    ...     numerical_columns=['num1', 'num2']
    ... )
    >>> print(cleaned_data)
    """
    # Separate the dataset into input variables (predictors) and target variable (response)
    label = df[target]
    inputs = df.drop(columns=[target])
    
    # Calculate Coefficient of Variation for continuous variables
    cv = inputs[numerical_columns[:-1]].std() / inputs[numerical_columns[:-1]].mean()
    
    # Identify features with CV less than 0.1
    low_cv_features = cv[cv < 0.1].index.tolist()
    print("Features with Coefficient of Variation less than 0.1:", low_cv_features)
    
    # Calculate Mode Category Percentage for categorical variables
    mode_percentage = inputs[categorical_columns].apply(lambda x: x.value_counts(normalize=True).max() * 100)
    
    # Identify features where the mode category percentage is greater than 95%
    high_mode_features = mode_percentage[mode_percentage > 95].index.tolist()
    print("Categorical features where mode category percentage is greater than 95%:", high_mode_features)
    
    # Calculate Percentage of Unique Categories for categorical variables
    unique_category_percentage = inputs[categorical_columns].nunique() / len(inputs) * 100
    
    # Identify features with a percentage of unique categories exceeding 90%
    high_unique_features = unique_category_percentage[unique_category_percentage > 90].index.tolist()
    print("Categorical features with percentage of unique categories exceeding 90%:", high_unique_features)
    
    # Combine all features to be removed
    features_to_remove = set(low_cv_features + high_mode_features + high_unique_features)
    print("Features to be removed:", features_to_remove)
    
    # Remove the identified features from the inputs dataframe
    cleaned_inputs = inputs.drop(columns=features_to_remove)
    
    # Combine the cleaned inputs with the label
    cleaned_dataset = pd.concat([cleaned_inputs, label], axis=1)
    
    # Display the cleaned dataset
    return cleaned_dataset


In [None]:
cleaned_dataset = feature_screening(no_outlier_data,target = "Price",
                      numerical_columns = numerical_columns,
                      categorical_columns=categorical_columns)

In [None]:
numerical_columns.remove("EC")

## Feature Extraction

In [None]:
def feature_extraction(df):
    # G (Glucose Level) / Purity:
    # Measures the proportion of glucose relative to the purity. Higher values may indicate adulteration or lower quality.
    df["pur_G_ratio"] = df["G"] / df["Purity"]
    
    # F (Fructose Level) / Purity:
    # Measures the proportion of fructose relative to purity. Natural honey tends to have more fructose than glucose.
    df["pur_F_ratio"] = df["F"] / df["Purity"]
    
    # WC (Water Content) / Purity:
    # Indicates the amount of water in relation to the purity. Excess water may suggest poor storage or added content.
    df["pur_WC_ratio"] = df["WC"] / df["Purity"]
    
    # (F + G) / Purity:
    # Represents the total sugar (fructose + glucose) content normalized by purity. Very high values may indicate sugar adulteration.
    df["pur_FG_ratio"] = (df["F"] + df["G"]) / df["Purity"]
    
    # CS (Color Score) / Purity:
    # Color can reflect the floral source and processing. This ratio helps explore how color relates to purity.
    df["pur_CS_ratio"] = df["CS"] / df["Purity"]
    
    # (Density * Viscosity) / Purity:
    # Combines two physical characteristics to analyze their joint effect on purity. Dense and viscous honey is usually higher quality.
    df["den_Vis_ratio"] = (df["Density"] * df["Viscosity"]) / df["Purity"]
    
    # (F² + G²) / Viscosity:
    # A non-linear combination of sugar content compared to viscosity. Helps assess how sugar concentration affects thickness.
    df["vis_FG_ratio"] = ((df["F"]**2) + (df["G"]**2)) / df["Viscosity"]

    return df


In [None]:
new_df =feature_extraction(cleaned_dataset)
new_df.head()

# Data Virtualization

## Correlation

In [None]:
# Korelasyon matrisi hesaplama
corr_matrix = new_df.corr(numeric_only=True)

# Üst üçgen için maske oluşturma
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))

# Heatmap oluşturma
plt.figure(figsize=(10, 8))
sns.heatmap(
    corr_matrix, 
    mask=mask, 
    annot=True,  # Korelasyon değerlerini göster
    fmt=".3f",  # Değer formatı
    cmap='coolwarm',  # Renk skalası
    vmin=-1, vmax=1,  # Korelasyon aralığı
    linewidths=0.5,  # Hücre ayracı genişliği
    cbar_kws={'shrink': 0.8}  # Renk çubuğu boyutu
)

plt.title('Korelasyon Matrisi (Üst Üçgen Gizlenmiş)', fontsize=16)
plt.tight_layout()
plt.show()


In [None]:
# Calculate correlations between one variable and others (e.g. target variable: 'target')
correlations = new_df.corr(numeric_only=True)['Price'].drop('Price')

# Çubuk grafiği
sns.barplot(x=correlations.values, y=correlations.index, palette='viridis')
plt.title("Correlation with Target Variable")
plt.xlabel("Correlation Coefficient")
plt.ylabel("Features")
plt.tight_layout()
plt.show()


## Histogram

In [None]:
# List Comprehension for numeric features
numerical_columns = [col for col in new_df.columns if new_df[col].dtypes in ["float","int"]]

In [None]:
fig,ax = plt.subplots(4,4,figsize=(15,12))
axes = ax.flatten()
for i,col in enumerate(numerical_columns):
    sns.histplot(data=cleaned_dataset , x=col, stat="frequency", ax=axes[i])


# Kullanılmayan eksenleri kapat
for j in range(len(numerical_columns), len(axes)):
    fig.delaxes(axes[j])
    
plt.tight_layout()
plt.show()

* **EC** and **Purity** columns contain descrete data.
* For the **Purity** column (low_purity, normal, high_purity), the **cat_purity** column can be created.
* For the **EC** column (low_ec, normal, high_ec), the **cat_ec** column can be created.

## Scatterplot

### Distribution of numerical variables according to the Price variable

In [None]:
fig,ax = plt.subplots(4,4,figsize=(15,12))
axes = ax.flatten()

for i,col in enumerate(numerical_columns):
    sns.scatterplot(data=new_df , x=col, y="Price" ,ax=axes[i])


# REMOVE unused axes
for j in range(len(numerical_columns), len(axes)):
    fig.delaxes(axes[j])
    
plt.tight_layout()
plt.show()

### Distribution of Pollen_analysis variable according to Price variable

In [None]:
sns.scatterplot(data=new_df , x="Pollen_analysis", y="Price",color="skyblue")

plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

**Category Differences:**

* There are differences in price ranges between categories.
- For example:
- The "Manuka" category seems to be more concentrated in the higher price range compared to other categories.
- Prices in the "Clover" and "Thyme" categories are generally concentrated in the middle range.
- Prices in categories such as "Acacia" and "Tupelo" are observed in a narrower range.

**Price Concentration:**

- In many categories, price values ​​are concentrated between 400-600 units.
- However, in some categories, there are lower or higher price values.
- It may mean that some honey categories are sold at higher prices. However, we cannot say this by just looking at this distribution.

In [None]:
pollen_price = new_df.groupby("Pollen_analysis")["Price"].mean().sort_values(ascending=False)
sns.barplot(y=pollen_price.index, x=pollen_price.values)
plt.title("Average Sales Prices by Pollen Types")
plt.tight_layout()
plt.show()

# Converting Categorical Variables to Numerical Format

- We will apply Label encoding to **Purity_group** and **ph_group** columns. Because there is a sorting between them. (For example: low,medium,normal)
- We will apply One Hot Encoding to **Pollen_analysis** column.

## Label Encoding

In [None]:
# new_df["Purity_group"] = new_df["Purity_group"].replace({"Low":0,
#                              "Medium":1,
#                              "High":2})

# new_df["pH_group"] = new_df["pH_group"].replace({"Low":0,
#                              "Medium":1,
#                              "Normal":2})

## One Hot Encoding

We will do one hot encoding using the **pd.get_dummies()** method.

In [None]:
# dummies_df = pd.get_dummies(new_df["Pollen_analysis"],drop_first=True,dtype="int") 

In [None]:
# model_data = pd.concat([new_df,dummies_df],axis=1).drop(columns="Pollen_analysis")
# print("Model dataset shape: ",model_data.shape)
# model_data.head()


# Separation of Feature and Label variables

In [None]:
X = model_data.drop(columns="Price")
y = model_data["Price"]

In [None]:
cats = X.select_dtypes(include="object").columns.tolist()
X[cats] = X[cats].astype("category")

# Create Model

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state=2025)

In [None]:
X_train.shape , X_test.shape

In [None]:
# import optuna
# import xgboost as xgb
# from sklearn.model_selection import KFold
# from sklearn.metrics import mean_squared_error
# import numpy as np

# # Define the objective function for hyperparameter optimization
# def objective(trial):
#     # Hyperparameters to be tuned using Optuna
#     param = {
#         'objective': 'reg:squarederror',
#         'eval_metric': 'rmse',
#         'colsample_bytree': trial.suggest_float('colsample_bytree', 0.3, 0.9),  # Proportion of features used for trees
#         'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.5),  # Learning rate
#         'max_depth': trial.suggest_int('max_depth', 3, 10),  # Depth of the tree
#         'alpha': trial.suggest_float('alpha', 0.01, 1),  # L2 regularization term
#         'n_estimators': trial.suggest_int('n_estimators', 100, 600),  # Number of trees
#         'subsample': trial.suggest_float('subsample', 0.6, 1.0),  # Proportion of data used for training
#     }
    
#     # Set up KFold for k-fold cross-validation (e.g., 5 folds)
#     kf = KFold(n_splits=5, shuffle=True, random_state=42)
    
#     # List to store the performance (RMSE) for each fold
#     cv_rmse_scores = []
    
#     # KFold loop
#     for train_index, val_index in kf.split(X_train):
#         # Split data into training and validation sets
#         train_x, valid_x = X_train.iloc[train_index], X_train.iloc[val_index]
#         train_y, valid_y = y_train.iloc[train_index], y_train.iloc[val_index]
        
#         # Create XGBoost model
#         model = xgb.XGBRegressor(**param,enable_categorical=True)

#         # Train the model on the training data
#         model.fit(train_x, train_y)

#         # Make predictions on the validation set
#         y_pred = model.predict(valid_x)

#         # Calculate RMSE and append it to the list
#         rmse = mean_squared_error(valid_y, y_pred, squared=False)  # squared=False returns RMSE
#         cv_rmse_scores.append(rmse)
    
#     # Return the average RMSE from all folds
#     mean_rmse = np.mean(cv_rmse_scores)
#     return mean_rmse  # Optuna will try to minimize this value

# # Start the Optuna optimization
# study = optuna.create_study(direction='minimize')  # Goal is to minimize RMSE
# study.optimize(objective, n_trials=50)  # 50 trials will be performed

# # Print the best parameters and the lowest RMSE found
# print("Best parameters: ", study.best_params)
# print("Lowest RMSE: ", study.best_value)

In [None]:
# Train the final model with the best parameters
best_params = {'colsample_bytree': 0.8841336691493017, 'learning_rate':
               0.4951441846210324, 'max_depth': 5, 'alpha': 0.8754677850368304, 'n_estimators': 598, 'subsample': 0.8728936635344561}
best_model = xgb.XGBRegressor(**best_params,enable_categorical=True)

# Retrain the model on the full data
best_model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = best_model.predict(X_test)

# Calculate the Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE) on the test set: {mse}")

In [None]:
# Let's visualize Actual vs Predicted values
plt.scatter(y_test, y_pred)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color='red', lw=2)
plt.xlabel('Gerçek Değerler')
plt.ylabel('Tahmin Edilen Değerler')
plt.title('Gerçek Değerler vs Tahmin Edilen Değerler')
plt.show()

In [None]:
# Save model
joblib.dump(best_model, 'xgboost_model.pkl')

# Let's load the saved model again
loaded_model = joblib.load('xgboost_model.pkl')

# Let's predict on the test data
y_pred_loaded = loaded_model.predict(X_test)
print(f"Yüklenen modelin R-kare skoru: {r2_score(y_test, y_pred_loaded):.2f}")
