# GeoLifeClef - TechieDuo Submission

Predicting the plant species present at a given location is helpful for many biodiversity management and conservation scenarios. Our approach towards the solution is by first merging all the training datasets based on their survey ID, then preprocessing the data and analyzing it further to get a better understanding. A similar approach is used on the test dataset as well. This helps to bring out the similarity between the training and the test data and further ease out its usage in the model for prediction.

Team Members:
1. Vijay Srinivas K
2. Vidisha Desai


In [None]:
# all import statements
import numpy as np
import pandas as pd
import os 

## Data Preprocessing:

*The dataset datas are analysed and all the files in it are found. Later the required fields are found out from these files and is made into a single large file containing all the data matched based on the survey ID  in a single file. The required data columns and labels are found from this combined dataset*



In [None]:
# getting all the files to process
csv_files = []    
for root, dirs, files in os.walk("/kaggle/input/geolifeclef-2024"):
    for file in files:
        if file.endswith(".csv"):
            csv_files.append(os.path.join(root, file))
#print(csv_files)

> Viewing the main dataset features

In [None]:
# Useful functions
def display_file_info(file_list):
    for fpath in file_list:
        df = pd.read_csv(fpath)
        print(fpath.split('/')[-1], ":", df.shape)
        print(df.columns)
        
def get_files(csv_files, con_str):
    op_files = []
    for fpath in csv_files:
        if not con_str in fpath:
            op_files.append(fpath)
    return op_files

In [None]:
# getting only the train data
csv_files_train = get_files(csv_files, "test")
non_time_series_train = get_files(csv_files_train, "time_series")

#display_file_info(csv_files)
#print("\nNon Times series data:")
display_file_info(non_time_series_train)
print("\nData sets to train: ", len(non_time_series_train))

> Now, we create a combined df of all the training datasets

In [None]:
# train data
non_time_series_train_1 = non_time_series_train
print(non_time_series_train_1)
non_time_series_train_1.remove("/kaggle/input/geolifeclef-2024/GLC24_P0_metadata_train.csv")
non_time_series_train_1.remove("/kaggle/input/geolifeclef-2024/GLC24_SAMPLE_SUBMISSION.csv")
non_time_series_train_1.remove("/kaggle/input/geolifeclef-2024/EnvironmentalRasters/EnvironmentalRasters/Climate/Monthly/GLC24-PA-train-bioclimatic_monthly.csv")
    
nts_train_dfs = []
for fpath in non_time_series_train_1:
    df = pd.read_csv(fpath)
    print(fpath.split('/')[-1], ":", df.shape)
    nts_train_dfs.append(df)

In [None]:
# merging and creating a combined df
merge_key = "surveyId"
merged_train_df = nts_train_dfs[0]
for i in range(1, len(nts_train_dfs)):
    merged_train_df = merged_train_df.merge(nts_train_dfs[i], on = merge_key)
print(merged_train_df.head(5))
merged_train_df.shape

In [None]:
val = list(merged_train_df.columns)
for i in val:
    print(i)

## Data Cleaning

We clean the data to get rid of duplicate rows, null values attributes and make the data uniformly distributed.

In [None]:
# we have the merged_train_df now 
merged_train_df = merged_train_df.dropna(axis = 0)
print(merged_train_df.shape)
merged_train_df.head()

In [None]:
'''# getting rid of unwanted multiple instances of rows
condition = (merged_train_df["speciesId_x"] == merged_train_df["speciesId_y"]) # removing all the data mismatches
merged_train_df_clean = merged_train_df[condition]
merged_train_df_clean = merged_train_df_clean.drop(axis = 1, columns = ["speciesId_y"])
merged_train_df_clean = merged_train_df_clean.rename(columns={'speciesId_x': 'speciesId'})
print(merged_train_df_clean.shape)
merged_train_df_clean.head()'''

> Note : Now we have a clean DataFrame merged_train_df_clean which has proper data and has been cleansed off all the unwated, repetitve data that might have been present in it. 
Now, we use this data to train a model, and test it's accuracy 

## Test Data Creation:
In this section, we follow the same process as above to create the test data. for using it to test the model that we will be training later on.

In [None]:
def create_test_data(csv_files):
    # getting the test only
    csv_files_test = get_files(csv_files, "train")
    non_time_series_test = get_files(csv_files_test, "time_series")
    
    non_time_series_test_1 = non_time_series_test
    print(non_time_series_test_1)
    
    non_time_series_test_1.remove('/kaggle/input/geolifeclef-2024/GLC24_SAMPLE_SUBMISSION.csv')
    non_time_series_test_1.remove('/kaggle/input/geolifeclef-2024/EnvironmentalRasters/EnvironmentalRasters/Climate/Monthly/GLC24-PA-test-bioclimatic_monthly.csv')
    
    nts_test_dfs = []
    for fpath in non_time_series_test_1:
        df = pd.read_csv(fpath)
        print(fpath.split('/')[-1], ":", df.shape)
        nts_test_dfs.append(df)
        
    merge_key = "surveyId"
    merged_test_df = nts_test_dfs[0]
    for i in range(1, len(nts_test_dfs)):
        merged_test_df = merged_test_df.merge(nts_test_dfs[i], on = merge_key)
    print(merged_test_df.head(5))
    merged_test_df.shape
    
    merged_test_df.fillna(0, inplace=True)
    print(merged_test_df.shape)
    merged_test_df.head()
    
    return merged_test_df
    
merged_test_df = create_test_data(csv_files)

>Converting categorical data to numerical data to avoid any issues later in the model training. The changes are made in training and test data. The categorical values of country and region have been mapped to numerical data.

In [None]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()

# for training data
merged_train_df['country_encoded'] = label_encoder.fit_transform(merged_train_df['country'])
merged_train_df=merged_train_df.drop(axis=1,columns = ["country"])

merged_train_df['region_encoded'] = label_encoder.fit_transform(merged_train_df['region'])
merged_train_df=merged_train_df.drop(axis=1,columns = ["region"])

print(merged_train_df)

In [None]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()

# for test data
merged_test_df['country_encoded'] = label_encoder.fit_transform(merged_test_df['country'])
merged_test_df=merged_test_df.drop(axis=1,columns = ["country"])

merged_test_df['region_encoded'] = label_encoder.fit_transform(merged_test_df['region'])
merged_test_df=merged_test_df.drop(axis=1,columns = ["region"])

print(merged_test_df)

# Model:

>KNN Classfier Model:
We have identified that this is a classification problem and have decided to go ahead with KNN Classfier Model

In [None]:
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

y_train = merged_train_df['speciesId']
X_train = merged_train_df.drop(columns='speciesId')

X_test = merged_test_df

X_train = X_train.replace([np.inf, -np.inf], np.nan)
X_train = X_train.dropna()

y_train = y_train.loc[X_train.index]

print(X_test.shape)
X_test = X_test.replace([np.inf, -np.inf], np.nan)
print(X_test.shape)
X_train = X_train[~X_train.isin([np.nan, np.inf, -np.inf]).any(axis=1)]
# Replace inf and -inf values with 0
X_test.replace([np.inf, -np.inf], 0, inplace=True)
print(X_test.shape)
# Ensure X_test has the same columns as X_train, filling any missing columns with 0
X_test = X_test.reindex(columns=X_train.columns, fill_value=0)
print(X_test.shape)
knn = KNeighborsClassifier(n_neighbors=3)

# Train the classifier
knn.fit(X_train, y_train)

In [None]:
X_test.fillna(0, inplace=True)
y_pred = knn.predict(X_test)

predictions = pd.DataFrame(y_pred, columns=['predictions'])
predictions['surveyId'] = X_test['surveyId'].values

column_order = ['surveyId','predictions']

predictions = predictions[column_order]

predictions.to_csv('submission.csv', index=False)

print(predictions)