### 최종 데이터 분석 리포팅 과제 
- Predict CO2 Emissions in Rwanda 
- co2rwanda.csv (구글 드라이브 업로드 완료)
- Kaggle 대회 데이터입니다.
- 컬럼 자체가 이해하기 어려울 수 있어서 구체적인 내용 같이 공유드립니다.
- 데이터 분석 전처리 필수적으로 잘 정리 부탁드립니다.

Dataset Description
The objective of this challenge is to create machine learning models that use open-source emissions data (from Sentinel-5P satellite observations) to predict carbon emissions.

Approximately 497 unique locations were selected from multiple areas in Rwanda, with a distribution around farm lands, cities and power plants. The data for this competition is split by time; the years 2019 - 2021 are included in the training data, and your task is to predict the CO2 emissions data for 2022 through November.

Seven main features were extracted weekly from Sentinel-5P from January 2019 to November 2022. Each feature (Sulphur Dioxide, Carbon Monoxide, etc) contain sub features such as column_number_density which is the vertical column density at ground level, calculated using the DOAS technique. You can read more about each feature in the below links, including how they are measured and variable definitions. You are given the values of these features in the test set and your goal to predict CO2 emissions using time information as well as these features.

Sulphur Dioxide - COPERNICUS/S5P/NRTI/L3_SO2
Carbon Monoxide - COPERNICUS/S5P/NRTI/L3_CO
Nitrogen Dioxide - COPERNICUS/S5P/NRTI/L3_NO2
Formaldehyde - COPERNICUS/S5P/NRTI/L3_HCHO
UV Aerosol Index - COPERNICUS/S5P/NRTI/L3_AER_AI
Ozone - COPERNICUS/S5P/NRTI/L3_O3
Cloud - COPERNICUS/S5P/OFFL/L3_CLOUD

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

%matplotlib inline
warnings.filterwarnings("ignore")

In [2]:
co2_df = pd.read_csv("co2rwanda.csv")

In [3]:
co2_df

Unnamed: 0,ID_LAT_LON_YEAR_WEEK,latitude,longitude,year,week_no,SulphurDioxide_SO2_column_number_density,SulphurDioxide_SO2_column_number_density_amf,SulphurDioxide_SO2_slant_column_number_density,SulphurDioxide_cloud_fraction,SulphurDioxide_sensor_azimuth_angle,...,Cloud_cloud_top_height,Cloud_cloud_base_pressure,Cloud_cloud_base_height,Cloud_cloud_optical_depth,Cloud_surface_albedo,Cloud_sensor_azimuth_angle,Cloud_sensor_zenith_angle,Cloud_solar_azimuth_angle,Cloud_solar_zenith_angle,emission
0,ID_-0.510_29.290_2019_00,-0.510,29.290,2019,0,-0.000108,0.603019,-0.000065,0.255668,-98.593887,...,3664.436218,61085.809570,2615.120483,15.568533,0.272292,-12.628986,35.632416,-138.786423,30.752140,3.750994
1,ID_-0.510_29.290_2019_01,-0.510,29.290,2019,1,0.000021,0.728214,0.000014,0.130988,16.592861,...,3651.190311,66969.478735,3174.572424,8.690601,0.256830,30.359375,39.557633,-145.183930,27.251779,4.025176
2,ID_-0.510_29.290_2019_02,-0.510,29.290,2019,2,0.000514,0.748199,0.000385,0.110018,72.795837,...,4216.986492,60068.894448,3516.282669,21.103410,0.251101,15.377883,30.401823,-142.519545,26.193296,4.231381
3,ID_-0.510_29.290_2019_03,-0.510,29.290,2019,3,,,,,,...,5228.507736,51064.547339,4180.973322,15.386899,0.262043,-11.293399,24.380357,-132.665828,28.829155,4.305286
4,ID_-0.510_29.290_2019_04,-0.510,29.290,2019,4,-0.000079,0.676296,-0.000048,0.121164,4.121269,...,3980.598120,63751.125781,3355.710107,8.114694,0.235847,38.532263,37.392979,-141.509805,22.204612,4.347317
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
79018,ID_-3.299_30.301_2021_48,-3.299,30.301,2021,48,0.000284,1.195643,0.000340,0.191313,72.820518,...,5459.185355,60657.101913,4590.879504,20.245954,0.304797,-35.140368,40.113533,-129.935508,32.095214,29.404171
79019,ID_-3.299_30.301_2021_49,-3.299,30.301,2021,49,0.000083,1.130868,0.000063,0.177222,-12.856753,...,5606.449457,60168.191528,4659.130378,6.104610,0.314015,4.667058,47.528435,-134.252871,30.771469,29.186497
79020,ID_-3.299_30.301_2021_50,-3.299,30.301,2021,50,,,,,,...,6222.646776,56596.027209,5222.646823,14.817885,0.288058,-0.340922,35.328098,-134.731723,30.716166,29.131205
79021,ID_-3.299_30.301_2021_51,-3.299,30.301,2021,51,-0.000034,0.879397,-0.000028,0.184209,-100.344827,...,7896.456885,46533.348194,6946.858022,32.594768,0.274047,8.427699,48.295652,-139.447849,29.112868,28.125792


In [4]:
co2_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79023 entries, 0 to 79022
Data columns (total 76 columns):
 #   Column                                                    Non-Null Count  Dtype  
---  ------                                                    --------------  -----  
 0   ID_LAT_LON_YEAR_WEEK                                      79023 non-null  object 
 1   latitude                                                  79023 non-null  float64
 2   longitude                                                 79023 non-null  float64
 3   year                                                      79023 non-null  int64  
 4   week_no                                                   79023 non-null  int64  
 5   SulphurDioxide_SO2_column_number_density                  64414 non-null  float64
 6   SulphurDioxide_SO2_column_number_density_amf              64414 non-null  float64
 7   SulphurDioxide_SO2_slant_column_number_density            64414 non-null  float64
 8   SulphurDioxide_c

- year, week_no는 int, 이를 제외한 다른 칼럼은 float

In [9]:
co2_df.isna().sum()

ID_LAT_LON_YEAR_WEEK            0
latitude                        0
longitude                       0
year                            0
week_no                         0
                             ... 
Cloud_sensor_azimuth_angle    484
Cloud_sensor_zenith_angle     484
Cloud_solar_azimuth_angle     484
Cloud_solar_zenith_angle      484
emission                        0
Length: 76, dtype: int64

In [10]:
nona_df = co2_df.dropna(axis=0)

In [11]:
nona_df

Unnamed: 0,ID_LAT_LON_YEAR_WEEK,latitude,longitude,year,week_no,SulphurDioxide_SO2_column_number_density,SulphurDioxide_SO2_column_number_density_amf,SulphurDioxide_SO2_slant_column_number_density,SulphurDioxide_cloud_fraction,SulphurDioxide_sensor_azimuth_angle,...,Cloud_cloud_top_height,Cloud_cloud_base_pressure,Cloud_cloud_base_height,Cloud_cloud_optical_depth,Cloud_surface_albedo,Cloud_sensor_azimuth_angle,Cloud_sensor_zenith_angle,Cloud_solar_azimuth_angle,Cloud_solar_zenith_angle,emission
155,ID_-0.510_29.290_2021_49,-0.510,29.290,2021,49,0.000024,0.895098,0.000040,0.073222,-42.108062,...,5019.467594,66195.733910,4045.293055,29.389584,0.267237,-64.947208,44.180872,-131.574370,35.591652,4.687898
451,ID_-0.547_29.653_2021_27,-0.547,29.653,2021,27,0.000084,0.724555,0.000090,0.158538,4.135166,...,3786.135092,73589.759158,3008.477774,13.889302,0.167235,-12.612077,37.755641,-39.070211,31.531712,0.637903
453,ID_-0.547_29.653_2021_29,-0.547,29.653,2021,29,-0.000065,0.692398,-0.000034,0.097820,4.431898,...,3227.241991,77346.158934,2227.241986,9.526436,0.174665,-42.409148,48.340040,-41.646779,30.232400,0.627023
474,ID_-0.547_29.653_2021_50,-0.547,29.653,2021,50,-0.000215,0.738936,-0.000155,0.079641,74.048912,...,6069.286554,56446.041636,5069.286504,6.718726,0.221424,26.161072,35.887857,-138.458953,31.743673,0.618269
1112,ID_-0.615_30.885_2021_52,-0.615,30.885,2021,52,0.000195,0.816164,0.000159,0.041291,74.313675,...,4901.979810,65507.318464,3901.979885,18.925466,0.252066,-11.962600,41.940240,-137.305986,32.674957,84.161446
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
78045,ID_-3.136_30.364_2021_29,-3.136,30.364,2021,29,-0.000041,0.560384,-0.000017,0.028559,4.671381,...,3276.807608,75745.985749,2397.274872,6.929006,0.189645,-30.680727,53.829194,-38.467197,32.124882,15.915551
78203,ID_-3.138_30.662_2021_28,-3.138,30.662,2021,28,-0.000296,0.582466,-0.000162,0.083189,74.629140,...,4787.405410,66259.879395,3789.424594,6.820494,0.179993,-41.721972,36.694905,-41.679800,33.989014,22.860306
78205,ID_-3.138_30.662_2021_30,-3.138,30.662,2021,30,-0.000177,0.636178,-0.000094,0.085995,-50.365193,...,3448.748725,74489.621613,2596.923233,5.897749,0.177577,-13.445917,47.642659,-38.895209,29.898948,22.815990
78216,ID_-3.138_30.662_2021_41,-3.138,30.662,2021,41,0.000056,0.694221,0.000039,0.000000,75.047003,...,6428.729028,54486.457256,5428.729056,9.640017,0.233593,-13.485498,43.845330,-102.903782,25.609043,25.353760


- 거의 모든 행에는 결측치가 포함되어있다.

In [12]:
basic_df = co2_df[["latitude", "longitude", "year", "week_no", "emission"]]

In [21]:
basic_df

Unnamed: 0,latitude,longitude,year,week_no,emission
0,-0.510,29.290,2019,0,3.750994
1,-0.510,29.290,2019,1,4.025176
2,-0.510,29.290,2019,2,4.231381
3,-0.510,29.290,2019,3,4.305286
4,-0.510,29.290,2019,4,4.347317
...,...,...,...,...,...
79018,-3.299,30.301,2021,48,29.404171
79019,-3.299,30.301,2021,49,29.186497
79020,-3.299,30.301,2021,50,29.131205
79021,-3.299,30.301,2021,51,28.125792


In [14]:
basic_df.isna().sum()

latitude     0
longitude    0
year         0
week_no      0
emission     0
dtype: int64

### KMeans

In [16]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, silhouette_samples

X_features = basic_df.values
X_features_scaled = StandardScaler().fit_transform(X_features)

kmeans = KMeans(n_clusters=3, random_state=0)
labels = kmeans.fit_predict(X_features_scaled)
basic_df['cluster_label']=labels

print('실루엣 스코어는 : {0:.3f}'.format(silhouette_score(X_features_scaled, labels)))

실루엣 스코어는 : 0.188


In [20]:
basic_df.drop(["cluster_label"], axis=1, inplace=True)

### KNeighborsRegressor

In [26]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [38]:
data = nona_df.drop(["ID_LAT_LON_YEAR_WEEK", "emission"], axis=1)
target = nona_df["emission"]

In [39]:
target.describe()

count    438.000000
mean      72.816951
std       85.516149
min        0.000000
25%        9.584258
50%       46.021778
75%      102.831272
max      789.720800
Name: emission, dtype: float64

In [40]:
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.25)

In [41]:
X_train

Unnamed: 0,latitude,longitude,year,week_no,SulphurDioxide_SO2_column_number_density,SulphurDioxide_SO2_column_number_density_amf,SulphurDioxide_SO2_slant_column_number_density,SulphurDioxide_cloud_fraction,SulphurDioxide_sensor_azimuth_angle,SulphurDioxide_sensor_zenith_angle,...,Cloud_cloud_top_pressure,Cloud_cloud_top_height,Cloud_cloud_base_pressure,Cloud_cloud_base_height,Cloud_cloud_optical_depth,Cloud_surface_albedo,Cloud_sensor_azimuth_angle,Cloud_sensor_zenith_angle,Cloud_solar_azimuth_angle,Cloud_solar_zenith_angle
71054,-2.842,28.958,2021,34,0.000028,0.661135,0.000021,0.046731,24.688312,38.385133,...,44652.600575,6820.649078,50974.961242,5820.649087,6.545612,0.201492,74.622744,34.870441,-50.126023,22.441532
155,-0.510,29.290,2021,49,0.000024,0.895098,0.000040,0.073222,-42.108062,26.770401,...,58866.697087,5019.467594,66195.733910,4045.293055,29.389584,0.267237,-64.947208,44.180872,-131.574370,35.591652
30981,-1.617,30.683,2021,29,0.000074,0.604304,0.000016,0.051609,16.563089,47.199117,...,59158.122340,4561.906744,66912.725513,3597.661064,5.760876,0.222739,30.236646,40.014546,-34.619959,27.864327
30837,-1.615,30.285,2021,44,-0.000104,0.761957,-0.000082,0.058287,15.715968,29.873582,...,56408.439522,5130.416212,63293.120905,4177.367227,41.580721,0.353132,4.710682,50.069595,-120.014383,29.542543
4429,-0.843,30.657,2021,30,-0.000053,0.669824,-0.000052,0.055806,-65.338907,29.843840,...,60025.003471,4207.542472,68304.924614,3207.542461,42.894350,0.219816,-12.393693,59.781333,-44.313844,30.626284
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49107,-2.162,30.138,2021,29,-0.000122,0.664142,-0.000077,0.068883,30.984506,42.170172,...,72777.538563,2726.083966,79966.614000,1965.188133,14.801180,0.229997,-13.789429,53.616472,-35.921122,29.936195
66291,-2.726,30.474,2021,41,-0.000125,0.907058,-0.000113,0.000000,74.007141,42.491909,...,51369.339813,5780.566158,58123.059695,4780.566131,12.910227,0.231751,-1.260198,44.845091,-104.491534,24.339183
45768,-2.056,30.544,2021,29,-0.000270,0.611186,-0.000142,0.110076,16.351636,48.355742,...,60960.760417,4090.741211,69343.736979,3090.741292,7.689898,0.233114,14.438571,52.024240,-30.810007,27.586115
43543,-1.978,31.122,2021,30,0.000060,0.589369,0.000038,0.018969,-25.775522,29.660557,...,67462.159843,3335.737612,76092.561460,2379.321396,7.061171,0.149369,-65.161878,42.922740,-47.401615,32.059897


In [42]:
y_train

71054     87.905190
155        4.687898
30981    243.306550
30837    116.396220
4429     104.954636
            ...    
49107     90.771730
66291      9.601950
45768    112.294914
43543    206.025380
34184     81.117775
Name: emission, Length: 328, dtype: float64

In [45]:
from math import sqrt

rmse_val = []
for K in range(30):
    K = K+1
    model = KNeighborsRegressor(n_neighbors = K)
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    error = sqrt(mean_squared_error(y_test,pred))
    rmse_val.append(error)
    print('RMSE value k',K,'=',error)
print('가장최소 rmse',min(rmse_val))

RMSE value k 1 = 138.2823942885198
RMSE value k 2 = 114.15028261771425
RMSE value k 3 = 112.62587655528391
RMSE value k 4 = 108.72788754904239
RMSE value k 5 = 107.25381651745622
RMSE value k 6 = 105.61827324143394
RMSE value k 7 = 104.9064869723858
RMSE value k 8 = 106.53340571072565
RMSE value k 9 = 105.78724205283893
RMSE value k 10 = 105.93426772375358
RMSE value k 11 = 105.06114346641701
RMSE value k 12 = 104.61027152106479
RMSE value k 13 = 104.58401187251529
RMSE value k 14 = 104.16069248929219
RMSE value k 15 = 104.60756230638982
RMSE value k 16 = 104.21058759573148
RMSE value k 17 = 104.40618181037964
RMSE value k 18 = 104.07861290101813
RMSE value k 19 = 103.16323340704325
RMSE value k 20 = 102.99310555380012
RMSE value k 21 = 102.58894250556583
RMSE value k 22 = 102.65760949202617
RMSE value k 23 = 102.89820897322775
RMSE value k 24 = 103.07010019734517
RMSE value k 25 = 103.26008253789266
RMSE value k 26 = 103.02859594460564
RMSE value k 27 = 103.28439636719715
RMSE value k

### OLS

In [49]:
pd.set_option('display.max_rows', 75)

In [67]:
from sklearn.preprocessing import StandardScaler

scaled_data = StandardScaler().fit_transform(data)
scaled_data = pd.DataFrame(scaled_data, columns=data.columns)

In [68]:
scaled_data

Unnamed: 0,latitude,longitude,year,week_no,SulphurDioxide_SO2_column_number_density,SulphurDioxide_SO2_column_number_density_amf,SulphurDioxide_SO2_slant_column_number_density,SulphurDioxide_cloud_fraction,SulphurDioxide_sensor_azimuth_angle,SulphurDioxide_sensor_zenith_angle,...,Cloud_cloud_top_pressure,Cloud_cloud_top_height,Cloud_cloud_base_pressure,Cloud_cloud_base_height,Cloud_cloud_optical_depth,Cloud_surface_albedo,Cloud_sensor_azimuth_angle,Cloud_sensor_zenith_angle,Cloud_solar_azimuth_angle,Cloud_solar_zenith_angle
0,2.019619,-1.355190,0.0,1.595120,0.056311,1.398519,0.224916,-0.221131,-1.163257,-0.689251,...,0.359028,-0.070371,0.347994,-0.080801,1.409132,1.038860,-1.602425,0.282990,-1.484496,1.986390
1,1.963515,-0.893923,0.0,-1.079885,0.343632,0.048334,0.580361,1.299509,-0.272336,0.095275,...,1.423005,-1.017659,1.210253,-0.890244,0.013016,-1.392428,-0.069755,-0.510113,0.863193,0.615095
2,1.963515,-0.893923,0.0,-0.836702,-0.370540,-0.206253,-0.300907,0.217279,-0.266620,0.357263,...,1.547349,-1.446929,1.648308,-1.500155,-0.379949,-1.211796,-0.942382,0.796379,0.797801,0.176237
3,1.963515,-0.893923,0.0,1.716711,-1.088963,0.162184,-1.163004,-0.106721,1.074620,0.249896,...,-0.785720,0.735965,-0.788972,0.718632,-0.632840,-0.074955,1.065744,-0.740664,-1.659222,0.686688
4,1.860405,0.671590,0.0,1.959893,0.878361,0.773601,1.074080,-0.790269,1.079721,0.189786,...,0.284736,-0.160610,0.267714,-0.192685,0.466625,0.670028,-0.050734,0.006416,-1.629961,1.001241
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
433,-1.962248,0.009551,0.0,-0.836702,-0.252537,-1.251403,-0.180366,-1.017192,-0.262006,0.378138,...,1.497184,-1.408860,1.461703,-1.367410,-0.613900,-0.847593,-0.598908,1.473937,0.878497,0.815446
434,-1.965281,0.388222,0.0,-0.958293,-1.475008,-1.076584,-1.207921,-0.043499,1.085798,-0.019426,...,0.281302,-0.248612,0.355474,-0.280558,-0.623674,-1.082267,-0.922258,-0.641046,0.796963,1.445079
435,-1.965281,0.388222,0.0,-0.715111,-0.906661,-0.651343,-0.731431,0.006527,-1.322339,0.229133,...,1.402906,-1.276796,1.315191,-1.211545,-0.706786,-1.141006,-0.094174,0.710298,0.867634,0.063609
436,-1.965281,0.388222,0.0,0.622391,0.212817,-0.191823,0.218528,-1.526218,1.093849,-0.369914,...,-1.020121,1.012043,-1.017491,0.999249,-0.369718,0.220897,-0.095333,0.241572,-0.756857,-1.385359


In [69]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
vif['VIF factor'] = [variance_inflation_factor(scaled_data, i) for i in range(scaled_data.shape[1])]
vif['features'] = scaled_data.columns
vif = vif.sort_values('VIF factor').reset_index(drop=True)
vif

Unnamed: 0,VIF factor,features
0,1.857581,Ozone_O3_effective_temperature
1,2.272506,longitude
2,2.430183,NitrogenDioxide_cloud_fraction
3,2.592181,Cloud_cloud_optical_depth
4,3.007489,NitrogenDioxide_tropopause_pressure
5,3.086085,Cloud_surface_albedo
6,3.290286,Cloud_sensor_zenith_angle
7,3.305531,CarbonMonoxide_cloud_height
8,3.569196,SulphurDioxide_cloud_fraction
9,3.627595,CarbonMonoxide_CO_column_number_density


In [70]:
new_data = scaled_data[["latitude", "longitude", "week_no", "Ozone_O3_effective_temperature", "NitrogenDioxide_cloud_fraction", "Cloud_cloud_optical_depth", "CarbonMonoxide_cloud_height", "SulphurDioxide_cloud_fraction"]]

In [171]:
new_data_col = ['latitude', 'longitude', 'year', 'week_no',
       'SulphurDioxide_SO2_column_number_density',
       'SulphurDioxide_SO2_column_number_density_amf',
       'SulphurDioxide_cloud_fraction', 'SulphurDioxide_sensor_azimuth_angle',
       'SulphurDioxide_sensor_zenith_angle',
       'SulphurDioxide_solar_azimuth_angle',
       'SulphurDioxide_solar_zenith_angle',
       'SulphurDioxide_SO2_column_number_density_15km',
       'CarbonMonoxide_CO_column_number_density',
       'CarbonMonoxide_H2O_column_number_density',
       'CarbonMonoxide_cloud_height', 'CarbonMonoxide_sensor_altitude',
       'CarbonMonoxide_sensor_azimuth_angle',
       'CarbonMonoxide_solar_azimuth_angle',
       'CarbonMonoxide_solar_zenith_angle',
       'NitrogenDioxide_NO2_column_number_density',
       'NitrogenDioxide_tropospheric_NO2_column_number_density',
       'NitrogenDioxide_stratospheric_NO2_column_number_density',
       'NitrogenDioxide_NO2_slant_column_number_density',
       'NitrogenDioxide_tropopause_pressure',
       'NitrogenDioxide_absorbing_aerosol_index',
       'NitrogenDioxide_cloud_fraction', 'NitrogenDioxide_sensor_altitude',
       'NitrogenDioxide_sensor_azimuth_angle',
       'NitrogenDioxide_sensor_zenith_angle',
       'Formaldehyde_tropospheric_HCHO_column_number_density',
       'Formaldehyde_tropospheric_HCHO_column_number_density_amf',
       'Formaldehyde_HCHO_slant_column_number_density',
       'Formaldehyde_cloud_fraction', 
       'Formaldehyde_solar_azimuth_angle', 'Formaldehyde_sensor_zenith_angle',
       'Formaldehyde_sensor_azimuth_angle',
       'UvAerosolIndex_absorbing_aerosol_index',
       'UvAerosolIndex_sensor_altitude', 'UvAerosolIndex_sensor_azimuth_angle',
       'UvAerosolIndex_sensor_zenith_angle',
       'UvAerosolIndex_solar_azimuth_angle',
       'UvAerosolIndex_solar_zenith_angle', 'Ozone_O3_column_number_density',
       'Ozone_O3_column_number_density_amf',
       'Ozone_O3_slant_column_number_density',
       'Ozone_O3_effective_temperature', 'Ozone_cloud_fraction',
       'Ozone_sensor_azimuth_angle', 'Ozone_sensor_zenith_angle',
       'Ozone_solar_azimuth_angle', 
       'UvAerosolLayerHeight_aerosol_height',
       'UvAerosolLayerHeight_aerosol_pressure',
       'UvAerosolLayerHeight_aerosol_optical_depth',
       'UvAerosolLayerHeight_sensor_zenith_angle',
       'UvAerosolLayerHeight_sensor_azimuth_angle',
       'Cloud_cloud_top_pressure', 'Cloud_cloud_top_height',
       'Cloud_cloud_base_pressure', 'Cloud_cloud_base_height',
       'Cloud_cloud_optical_depth', 
       'Cloud_sensor_azimuth_angle', 'Cloud_sensor_zenith_angle',
       'Cloud_solar_zenith_angle']

In [93]:
scaled_df = scaled_data.copy()
scaled_df["emission"] = nona_df["emission"].reset_index(drop=True)

In [173]:
from statsmodels.formula.api import ols
import statsmodels.api as sm

model=sm.OLS.from_formula('emission~'+'+'.join(new_data_col),data=scaled_df)
res = model.fit()
res.summary()

0,1,2,3
Dep. Variable:,emission,R-squared:,0.3
Model:,OLS,Adj. R-squared:,0.184
Method:,Least Squares,F-statistic:,2.586
Date:,"Sun, 10 Sep 2023",Prob (F-statistic):,1.95e-08
Time:,00:01:57,Log-Likelihood:,-2491.6
No. Observations:,438,AIC:,5109.0
Df Residuals:,375,BIC:,5366.0
Df Model:,62,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,72.8170,3.692,19.724,0.000,65.558,80.076
latitude,7.9793,14.830,0.538,0.591,-21.182,37.140
longitude,-4.3534,5.419,-0.803,0.422,-15.009,6.302
year,-3.992e-11,3.51e-11,-1.137,0.256,-1.09e-10,2.91e-11
week_no,-52.3220,51.888,-1.008,0.314,-154.351,49.706
SulphurDioxide_SO2_column_number_density,37.4440,14.842,2.523,0.012,8.261,66.627
SulphurDioxide_SO2_column_number_density_amf,-16.3817,8.289,-1.976,0.049,-32.681,-0.082
SulphurDioxide_cloud_fraction,12.1912,6.886,1.770,0.077,-1.349,25.732
SulphurDioxide_sensor_azimuth_angle,34.1870,19.164,1.784,0.075,-3.495,71.869

0,1,2,3
Omnibus:,174.873,Durbin-Watson:,1.499
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1031.047
Skew:,1.612,Prob(JB):,1.29e-224
Kurtosis:,9.79,Cond. No.,9800000000000000.0


In [177]:
res.predict(scaled_df[new_data_col])

0      47.157634
1      30.246181
2       6.344636
3      89.039956
4      79.577840
         ...    
433    18.645250
434    24.040887
435    44.639369
436    92.939469
437    34.047228
Length: 438, dtype: float64

In [178]:
scaled_df["emission"]

0       4.687898
1       0.637903
2       0.627023
3       0.618269
4      84.161446
         ...    
433    15.915551
434    22.860306
435    22.815990
436    25.353760
437    46.455006
Name: emission, Length: 438, dtype: float64