<a href="https://colab.research.google.com/github/seungwoosoon/SmartFarmProject/blob/AI/condition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Load the CSV file "MyDrive/asset/condtion/condtion1.csv", fill missing values using linear interpolation, and remove the columns '시작', '종료', '누적량1', '일출시간', '일몰시간', 'EC급액', 'EC배액', 'pH급액', '저울급액', '저울배액', '온도 급액', '내부 이산화탄소', '내부광량', '내부PT100온도센서1', '내부 PT100온도센서2', '내부 PT100온도센서3', '내부PT100센서를 이용', and '설정pH'. Then, explain how to predict weekly tomato fruit set based on weekly environmental data and weekly growth indicators, considering the current time difference between the time series data and growth indicators by aggregating data weekly.

## 데이터 준비

### Subtask:
환경 데이터와 생장 지표 데이터를 불러오고 시간 정보를 주차 정보로 변환합니다. (필요하다면 이전 단계에서 처리된 환경 데이터를 사용합니다.)


**Reasoning**:
The previous step did not load the data, so the first step is to load the data from the specified CSV file into a pandas DataFrame.



**Reasoning**:
The previous attempt to load the data failed due to a `FileNotFoundError`. This indicates the file path was incorrect. I will try a different path to load the data.



## 데이터 준비

### Subtask:
환경 데이터와 생장 지표 데이터를 불러오고 시간 정보를 주차 정보로 변환합니다. (필요하다면 이전 단계에서 처리된 환경 데이터를 사용합니다.)


**Reasoning**:
Load the CSV file again, this time trying to find the correct path. If the file is still not found, it indicates a persistent issue with the file path or existence, and I will report failure.



## 데이터 준비

### Subtask:
환경 데이터와 생장 지표 데이터를 불러오고 시간 정보를 주차 정보로 변환합니다. (필요하다면 이전 단계에서 처리된 환경 데이터를 사용합니다.)


**Reasoning**:
Attempt to load the data from the specified path and the current directory, handle missing values with linear interpolation, remove the specified columns, and display the head of the dataframe.



## 주간 데이터 집계

### Subtask:
환경 데이터와 생장 지표 데이터를 주차별로 집계합니다. 환경 데이터는 평균값을, 생장 지표는 주차 마지막 값을 사용하거나 다른 적절한 집계 방법을 선택합니다. 토마토 착과수는 주차별 총합을 계산합니다.


## 주간 데이터 집계

### Subtask:
환경 데이터와 생장 지표 데이터를 주차별로 집계합니다. 환경 데이터는 평균값을, 생장 지표는 주차 마지막 값을 사용하거나 다른 적절한 집계 방법을 선택합니다. 토마토 착과수는 주차별 총합을 계산합니다.


**Reasoning**:
Based on the previous attempts, the dataframes `df_growth` and `df_fruit_set` are not available. I will proceed with aggregating the available `df_condition` by '주차' using the mean, as requested in the instructions. I will then display the head of the resulting dataframe.



In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import pandas as pd

file_path = '/content/drive/MyDrive/asset/condition/condition1.csv'

try:
    df_condition = pd.read_csv(file_path)

    # Fill missing values using linear interpolation
    df_condition.interpolate(method='linear', inplace=True)

    # Remove specified columns
    columns_to_drop = ['시작', '종료', '누적량1', '일출시간', '일몰시간', 'EC급액', 'EC배액', 'pH급액', '저울급액', '저울배액', '온도 급액', '내부 이산화탄소', '내부광량', '내부PT100온도센서1', '내부 PT100온도센서2', '내부 PT100온도센서3', '내부PT100센서를 이용', '설정pH']
    df_condition.drop(columns=columns_to_drop, errors='ignore', inplace=True)

    print("Data loaded, missing values filled, and specified columns dropped.")
    display(df_condition.head())

except FileNotFoundError:
    print(f"Error: The file was not found at {file_path}. Please check the path and try again.")
except Exception as e:
    print(f"An error occurred: {e}")

In [None]:
# Convert 'date' column to datetime objects
df_condition['date'] = pd.to_datetime(df_condition['date'])

# Set 'date' as the index
df_condition.set_index('date', inplace=True)

# Resample the data weekly and calculate the mean for environmental data
df_weekly_condition = df_condition.resample('W').mean()

# Assuming '토마토 착과수' column exists in df_condition, calculate the weekly sum
# If '토마토 착과수' is in a separate dataframe, load and merge it first.
# For now, let's assume it's not in df_condition and needs to be loaded separately.
# I will proceed with only the weekly environmental data aggregation.

print("Weekly environmental data aggregated.")
display(df_weekly_condition.head())

In [None]:
# Select relevant columns based on user's request
selected_columns = ['현재EC(dS)', '현재PH(pH)', '현재일사(W)', '누적일사(J)', '내부이산화탄소', '내부PT100온도센서1번건구', '내부PT100온도센서2번습구', '내부PT100온도센서3번', '내부PT100센서를이용한계산습도']
df_selected_env = df_weekly_condition[selected_columns].copy()

# Calculate the average of the three temperature sensor columns
temperature_columns = ['내부PT100온도센서1번건구', '내부PT100온도센서2번습구', '내부PT100온도센서3번']
df_selected_env['평균온도'] = df_selected_env[temperature_columns].mean(axis=1)

# Format the average temperature to one decimal place
df_selected_env['평균온도'] = df_selected_env['평균온도'].round(1)

# Drop the original individual temperature columns
df_selected_env.drop(columns=temperature_columns, inplace=True)

# Interpolate missing values in the '내부이산화탄소' column
df_selected_env['내부이산화탄소'].interpolate(method='linear', inplace=True)


print("Selected environmental data with average temperature and interpolated CO2 values:")
display(df_selected_env.head())

print("\n'내부이산화탄소' 칼럼의 결측치 개수 (보간 후):")
display(df_selected_env['내부이산화탄소'].isnull().sum())

In [None]:
print("결측치 여부 (True: 결측치 있음):")
display(df_selected_env['내부이산화탄소'].isnull().any())

print("\n'내부이산화탄소' 칼럼의 결측치 개수:")
display(df_selected_env['내부이산화탄소'].isnull().sum())

In [None]:
# Drop rows with any missing values
df_cleaned_env = df_selected_env.dropna()

print("결측치가 포함된 행 제거 후 데이터:")
display(df_cleaned_env.head())

print("\n결측치 개수 (모든 칼럼):")
display(df_cleaned_env.isnull().sum())

# Task
Analyze the provided time-series environmental data from "MyDrive/asset/condtion/condition1.csv", which has been preprocessed to handle missing values and select relevant columns (온도, 일사량, 습도, 이산화탄소 농도, pH, 현재 일사, 누적 일사, 내부 PT100온도센서 평균). Combine this with weekly growth data (assuming it is available or will be provided) to predict weekly tomato growth (CM) using an MLP model. The environmental data should be aggregated into 3-day intervals for statistical analysis before being used with the weekly growth data. Provide the steps to prepare the data, build and train the MLP model, and evaluate its performance.

## 주간 데이터 준비

### Subtask:
주간 환경 데이터 (`df_cleaned_env`)와 주간 생장 지표 데이터를 준비합니다. (주간 생장 지표 데이터가 없다면 불러오거나 주간으로 집계합니다.)


**Reasoning**:
Check if the DataFrame `df_growth_weekly` exists. If it does, display its head; otherwise, explain that the data is missing and describe the expected format.



In [None]:
try:
    # Check if df_growth_weekly exists
    if 'df_growth_weekly' in locals():
        print("Weekly growth indicator data found.")
        display(df_growth_weekly.head())
    else:
        print("Weekly growth indicator data (df_growth_weekly) not found.")
        print("This data is needed for the next steps.")
        print("Expected format: A DataFrame with a time identifier column (e.g., '주차' or a datetime index) and a target variable column ('주간생장량(CM)').")

except Exception as e:
    print(f"An error occurred while checking for weekly growth data: {e}")


## 데이터 병합

### Subtask:
주간 환경 데이터와 주간 생장 지표 데이터를 주차 또는 날짜를 기준으로 병합하여 최종 학습 데이터셋을 생성합니다.


**Reasoning**:
The previous step indicated that the weekly growth data was missing. To proceed with merging the environmental and growth data, I need to load a placeholder weekly growth dataset with a compatible time identifier and a target variable column.



In [None]:
import pandas as pd
import numpy as np

# Create a placeholder DataFrame for weekly growth data
# Assuming the weekly growth data has a 'date' column and a '주간생장량(CM)' column
# The dates should align with or overlap with the dates in df_cleaned_env

# Generate sample dates (weekly) that align with df_cleaned_env index
start_date = df_cleaned_env.index.min()
end_date = df_cleaned_env.index.max()
weekly_dates = pd.date_range(start=start_date, end=end_date, freq='W')

# Generate random placeholder growth data
np.random.seed(42) # for reproducibility
weekly_growth_cm = np.random.rand(len(weekly_dates)) * 10 # Random growth between 0 and 10 cm

df_growth_weekly = pd.DataFrame({'date': weekly_dates, '주간생장량(CM)': weekly_growth_cm})

# Set 'date' as the index to match df_cleaned_env
df_growth_weekly.set_index('date', inplace=True)

print("Placeholder weekly growth data created:")
display(df_growth_weekly.head())

**Reasoning**:
The previous command failed because `df_cleaned_env` was not defined. I need to redefine `df_cleaned_env` by repeating the data loading, cleaning, and aggregation steps from the previous successful execution, and then proceed with creating the placeholder weekly growth data.



In [None]:
import pandas as pd
import numpy as np

# Repeat data loading and cleaning steps to define df_cleaned_env
file_path = '/content/drive/MyDrive/asset/condition/condition1.csv'

try:
    df_condition = pd.read_csv(file_path)

    # Fill missing values using linear interpolation
    df_condition.interpolate(method='linear', inplace=True)

    # Remove specified columns
    columns_to_drop = ['시작', '종료', '누적량1', '일출시간', '일몰시간', 'EC급액', 'EC배액', 'pH급액', '저울급액', '저울배액', '온도 급액', '내부 이산화탄소', '내부광량', '내부PT100온도센서1', '내부 PT100온도센서2', '내부 PT100온도센서3', '내부PT100센서를 이용', '설정pH']
    df_condition.drop(columns=columns_to_drop, errors='ignore', inplace=True)

    # Convert 'date' column to datetime objects
    df_condition['date'] = pd.to_datetime(df_condition['date'])

    # Set 'date' as the index
    df_condition.set_index('date', inplace=True)

    # Resample the data weekly and calculate the mean for environmental data
    df_weekly_condition = df_condition.resample('W').mean()

    # Select relevant columns
    selected_columns = ['현재EC(dS)', '현재PH(pH)', '현재일사(W)', '누적일사(J)', '내부이산화탄소', '내부PT100온도센서1번건구', '내부PT100온도센서2번습구', '내부PT100온도센서3번', '내부PT100센서를 이용한 계산 습도']
    df_selected_env = df_weekly_condition[selected_columns].copy()

    # Calculate the average of the three temperature sensor columns
    temperature_columns = ['내부PT100온도센서1번건구', '내부PT100온도센서2번습구', '내부PT100온도센서3번']
    df_selected_env['평균온도'] = df_selected_env[temperature_columns].mean(axis=1)

    # Format the average temperature to one decimal place
    df_selected_env['평균온도'] = df_selected_env['평균온도'].round(1)

    # Drop the original individual temperature columns
    df_selected_env.drop(columns=temperature_columns, inplace=True)

    # Interpolate missing values in the '내부이산화탄소' column
    df_selected_env['내부이산화탄소'].interpolate(method='linear', inplace=True)

    # Drop rows with any missing values
    df_cleaned_env = df_selected_env.dropna()

    print("Environmental data loaded, cleaned, aggregated weekly, and missing values handled.")
    display(df_cleaned_env.head())

    # Now, create the placeholder DataFrame for weekly growth data
    # Generate sample dates (weekly) that align with df_cleaned_env index
    start_date = df_cleaned_env.index.min()
    end_date = df_cleaned_env.index.max()
    weekly_dates = pd.date_range(start=start_date, end=end_date, freq='W')

    # Generate random placeholder growth data
    np.random.seed(42) # for reproducibility
    weekly_growth_cm = np.random.rand(len(weekly_dates)) * 10 # Random growth between 0 and 10 cm

    df_growth_weekly = pd.DataFrame({'date': weekly_dates, '주간생장량(CM)': weekly_growth_cm})

    # Set 'date' as the index to match df_cleaned_env
    df_growth_weekly.set_index('date', inplace=True)

    print("\nPlaceholder weekly growth data created:")
    display(df_growth_weekly.head())

except FileNotFoundError:
    print(f"Error: The file was not found at {file_path}. Please check the path and try again.")
except Exception as e:
    print(f"An error occurred: {e}")

In [None]:
import pandas as pd

file_path_growth = '/content/drive/MyDrive/updated_sorted_data/tom1_updated_sorted_data.xlsx'

try:
    df_growth = pd.read_excel(file_path_growth)

    print("주간 생장 데이터 로드 완료:")
    display(df_growth.head())

except FileNotFoundError:
    print(f"Error: 주간 생장 데이터 파일을 찾을 수 없습니다: {file_path_growth}. 경로를 확인해주세요.")
except Exception as e:
    print(f"주간 생장 데이터 로드 중 오류 발생: {e}")

In [None]:
import pandas as pd

file_path_growth = '/content/drive/MyDrive/updated_sorted_data/tom1_updated_sorted_data.xlsx'

try:
    df_growth = pd.read_excel(file_path_growth)

    print("주간 생장 데이터 로드 완료:")
    display(df_growth.head())

except FileNotFoundError:
    print(f"Error: 주간 생장 데이터 파일을 찾을 수 없습니다: {file_path_growth}. 경로를 확인해주세요.")
except Exception as e:
    print(f"주간 생장 데이터 로드 중 오류 발생: {e}")

In [None]:
# Assign correct column names based on the last row
df_growth.columns = df_growth.iloc[-1]

# Drop the last row which contains the column names
df_growth = df_growth.iloc[:-1]

# Convert the 'date' column to datetime objects
df_growth['date'] = pd.to_datetime(df_growth['date'], format='%Y%m%d')

# Set 'date' as the index
df_growth.set_index('date', inplace=True)

# Convert relevant columns to numeric, coercing errors
df_growth['week'] = pd.to_numeric(df_growth['week'], errors='coerce')
df_growth['weeklyGrowth'] = pd.to_numeric(df_growth['weeklyGrowth'], errors='coerce')

# Drop rows where key columns like 'weeklyGrowth' became NaN after coercion
df_growth.dropna(subset=['weeklyGrowth'], inplace=True)


print("주간 생장 데이터 전처리 완료:")
display(df_growth.head())

print("\n전처리 후 데이터 정보:")
df_growth.info()

In [None]:
# Merge the weekly environmental data and weekly growth dataframes on their index (date)
df_merged = pd.merge(df_cleaned_env, df_growth[['weeklyGrowth']], left_index=True, right_index=True, how='inner')

print("주간 환경 데이터와 생장 데이터 병합 완료:")
display(df_merged.head())

print("\n병합된 데이터 정보:")
df_merged.info()

In [None]:
import pandas as pd
import numpy as np

# Repeat data loading and cleaning steps to define df_cleaned_env
file_path = '/content/drive/MyDrive/asset/condition/condition1.csv'

try:
    df_condition = pd.read_csv(file_path)

    # Fill missing values using linear interpolation
    df_condition.interpolate(method='linear', inplace=True)

    # Remove specified columns
    columns_to_drop = ['시작', '종료', '누적량1', '일출시간', '일몰시간', 'EC급액', 'EC배액', 'pH급액', '저울급액', '저울배액', '온도 급액', '내부 이산화탄소', '내부광량', '내부PT100온도센서1', '내부 PT100온도센서2', '내부 PT100온도센서3', '내부PT100센서를 이용', '설정pH']
    df_condition.drop(columns=columns_to_drop, errors='ignore', inplace=True)

    # Convert 'date' column to datetime objects
    df_condition['date'] = pd.to_datetime(df_condition['date'])

    # Set 'date' as the index
    df_condition.set_index('date', inplace=True)

    # Resample the data weekly and calculate the mean for environmental data
    df_weekly_condition = df_condition.resample('W').mean()

    # Select relevant columns
    selected_columns = ['현재EC(dS)', '현재PH(pH)', '현재일사(W)', '누적일사(J)', '내부이산화탄소', '내부PT100온도센서1번건구', '내부PT100온도센서2번습구', '내부PT100온도센서3번', '내부PT100센서를이용한계산습도']
    df_selected_env = df_weekly_condition[selected_columns].copy()

    # Rename the humidity column
    df_selected_env.rename(columns={'내부PT100센서를이용한계산습도': '습도'}, inplace=True)

    # Calculate the average of the three temperature sensor columns
    temperature_columns = ['내부PT100온도센서1번건구', '내부PT100온도센서2번습구', '내부PT100온도센서3번']
    df_selected_env['평균온도'] = df_selected_env[temperature_columns].mean(axis=1)

    # Format the average temperature to one decimal place
    df_selected_env['평균온도'] = df_selected_env['평균온도'].round(1)

    # Drop the original individual temperature columns
    df_selected_env.drop(columns=temperature_columns, inplace=True)

    # Interpolate missing values in the '내부이산화탄소' column
    df_selected_env['내부이산화탄소'].interpolate(method='linear', inplace=True)

    # Drop rows with any missing values
    df_cleaned_env = df_selected_env.dropna()

    print("Environmental data loaded, cleaned, aggregated weekly, and missing values handled.")
    display(df_cleaned_env.head())

    # Merge the weekly environmental data and weekly growth dataframes on their index (date)
    # Assuming df_growth is already loaded and preprocessed
    if 'df_growth' in locals():
        df_merged = pd.merge(df_cleaned_env, df_growth[['weeklyGrowth']], left_index=True, right_index=True, how='inner')

        print("\n주간 환경 데이터와 생장 데이터 병합 완료:")
        display(df_merged.head())

        print("\n병합된 데이터 정보:")
        df_merged.info()
    else:
        print("\nError: 주간 생장 데이터 (df_growth)가 로드되지 않았습니다. 생장 데이터를 먼저 로드해주세요.")


except FileNotFoundError:
    print(f"Error: The file was not found at {file_path}. Please check the path and try again.")
except Exception as e:
    print(f"An error occurred: {e}")

## 데이터 전처리

### Subtask:
MLP 모델 학습을 위해 병합된 데이터를 전처리합니다. 특성 스케일링을 수행하고, 필요하다면 시계열 데이터의 특성을 고려한 추가 전처리를 진행합니다.

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Separate features (X) and target (y)
X = df_merged.drop('weeklyGrowth', axis=1)
y = df_merged['weeklyGrowth']

# Initialize the MinMaxScaler
scaler = MinMaxScaler()

# Fit the scaler to the features and transform
X_scaled = scaler.fit_transform(X)

print("Features scaled:")
display(X_scaled[:5])

print("\nTarget variable (first 5 values):")
display(y.head())

## 데이터 분할

### Subtask:
학습 데이터셋을 훈련 세트와 테스트 세트로 분할합니다.

In [None]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
# Using a standard split (e.g., 80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

print("Data split into training and testing sets:")
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

## MLP 모델 구축 및 학습

### Subtask:
MLP 회귀 모델을 정의하고 훈련 데이터셋을 사용하여 학습시킵니다.

In [None]:
from sklearn.neural_network import MLPRegressor

# Define the MLP Regressor model
# You can adjust the hidden_layer_sizes, activation, solver, etc.
mlp_model = MLPRegressor(hidden_layer_sizes=(64, 32), activation='relu', solver='adam', max_iter=500, random_state=42)

# Train the model using the training data
mlp_model.fit(X_train, y_train)

print("MLP model training complete.")

## 모델 평가

### Subtask:
학습된 MLP 모델의 성능을 테스트 데이터셋을 사용하여 평가합니다.

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

# Make predictions on the test set
y_pred = mlp_model.predict(X_test)

# Evaluate the model using various metrics
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("MLP Model Evaluation:")
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
print(f"Mean Absolute Error (MAE): {mae:.4f}")
print(f"R-squared (R2): {r2:.4f}")