# Energia Injetada (2021) - Collected Data Analysis
This is a jupyternotebook intended to read, understand and validate the data from the file "energia_202109-202112.csv".

> **Goals**:
>
> - Create a data frame;
> - Create simpler labels;
> - Validate values.


##### We'll use "pandas" to read the **CSV** file and create a data frame from it.

In [1]:
import pandas as pd

energia_2021 = pd.read_csv ("energia_202109-202112.csv", encoding='IBM860') #Use the portuguese encoding.

##### Now we want to inspect the features labels:

In [2]:
energia_2021.columns

Index(['Data', 'Hora', 'Normal (kWh)', 'Horßrio Econ≤mico (kWh)',
       'Autoconsumo (kWh)', 'Injeτπo na rede (kWh)'],
      dtype='object')

##### With this approach we conclude that although the file was readed with the right encode, "pandas" does not support it internally.
##### For that reason we will convert the labels to an 'UTF-8' encode.

In [3]:
energia_2021.rename (columns={"Horßrio Econ≤mico (kWh)": "Economic (kWh)"}, inplace=True)
energia_2021.rename (columns={"Injeτπo na rede (kWh)": "Injection (kWh)"}, inplace=True)

##### Once more we shall check the results as above:

In [4]:
energia_2021.columns

Index(['Data', 'Hora', 'Normal (kWh)', 'Economic (kWh)', 'Autoconsumo (kWh)',
       'Injection (kWh)'],
      dtype='object')

##### The labels are for now ok to work with.
##### Now it's needed to check for the overall content of each feature.

In [5]:
energia_2021.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2256 entries, 0 to 2255
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Data               2256 non-null   object 
 1   Hora               2256 non-null   int64  
 2   Normal (kWh)       2256 non-null   float64
 3   Economic (kWh)     2256 non-null   float64
 4   Autoconsumo (kWh)  2256 non-null   float64
 5   Injection (kWh)    566 non-null    object 
dtypes: float64(3), int64(1), object(2)
memory usage: 105.9+ KB


##### This approach showed us something that make us relutant about the content of the values on the feature **No.5**: "Injectin (kWh)".

In [6]:
energia_2021.head()

Unnamed: 0,Data,Hora,Normal (kWh),Economic (kWh),Autoconsumo (kWh),Injection (kWh)
0,2021-09-29,0,0.0,0.0,0.0,
1,2021-09-29,1,0.0,0.0,0.0,
2,2021-09-29,2,0.0,0.0,0.0,
3,2021-09-29,3,0.0,0.0,0.0,
4,2021-09-29,4,0.0,0.0,0.0,


##### Luckly we were able to encounter the problem.
##### Nontheless, we make a further inspection through the values of the feature in current investigation.

In [7]:
values = set()
for value in energia_2021["Injection (kWh)"]:
    values.add(value)
print (values)

{'Medium', nan, 'High', 'Low', 'Very High'}


The problem it's that the value of this categorical feature has a 'None' as a class value.\
When readed from ***python*** make's it a null variable object, in this case, it is recognized as a **NaN** (Not a Number).

Enceforth we shall procide with a subtle tranformation of this categorical feature.\
Once this categorical feature has an **order** between values, we'll proceed to transform it into a numerical ordinal one.
- ***None:*** 0
- ***Low:*** 1
- ***Medium:*** 2
- ***High:*** 3
- ***Very High:*** 4

In [8]:
def no5_to_ordinal (columns):
    Injection = columns[0]
    
    if pd.isnull(Injection):
        return 0
   
    elif Injection == "Low":
        return 1
    
    elif Injection == "Medium":
        return 2
    
    elif Injection == "High":
        return 3
    
    elif Injection == "Very High":
        return 4

To check if the transformation was correctly applied we'll index the correspondent category thrugh a ***dictionary*** before and after the transformation.

#### Henceforth, before:
> ...
>
> 100: NaN
>
> 101: "Low"
>
> 102: "Medium"
>
> 103: "High"
>
> 104: "Very High"
>
> ...

In [9]:
id_no5_categorical = {}
index = 0
for value in energia_2021["Injection (kWh)"]:
    id_no5_categorical[index] = value
    index+=1

#### Apply the transformation:

In [10]:
energia_2021["Injection (kWh)"] = energia_2021[['Injection (kWh)']].apply(no5_to_ordinal, axis=1)

#### Henceforth, after:
> ...
>
> 100: 0
>
> 101: 1
>
> 102: 2
>
> 103: 3
>
> 104: 4
>
> ...

In [11]:
id_no5_ordinal = {}
index = 0
for value in energia_2021["Injection (kWh)"]:
    id_no5_ordinal[index] = value
    index+=1

#### Check the correctness of the transformation:

In [12]:
import math

error = 0 #no error
if len(id_no5_categorical) == len(id_no5_ordinal):
    for index in range(len(id_no5_ordinal)):
        if ( str(id_no5_categorical[index]) == 'nan' and id_no5_ordinal[index] == 0):
            pass
        elif ( str(id_no5_categorical[index]) == "Low" and id_no5_ordinal[index] == 1):
            pass
        elif ( str(id_no5_categorical[index]) == "Medium" and id_no5_ordinal[index] == 2):
            pass
        elif ( str(id_no5_categorical[index]) == "High" and id_no5_ordinal[index] == 3):
            pass
        elif ( str(id_no5_categorical[index]) == "Very High" and id_no5_ordinal[index] == 4):
            pass
        else:
            error = 1 #wrong transformation
            print ("Error: Transformation gonne wrong!")
            break
            
if len(id_no5_categorical) != len(id_no5_ordinal):
    error = 2 #data loss
    print ("Error: Data Size Structures does not match!")

#Verbose
print (error)

0


#### We check the data frame once more:

In [13]:
energia_2021.info()
energia_2021.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2256 entries, 0 to 2255
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Data               2256 non-null   object 
 1   Hora               2256 non-null   int64  
 2   Normal (kWh)       2256 non-null   float64
 3   Economic (kWh)     2256 non-null   float64
 4   Autoconsumo (kWh)  2256 non-null   float64
 5   Injection (kWh)    2256 non-null   int64  
dtypes: float64(3), int64(2), object(1)
memory usage: 105.9+ KB


Unnamed: 0,Data,Hora,Normal (kWh),Economic (kWh),Autoconsumo (kWh),Injection (kWh)
0,2021-09-29,0,0.0,0.0,0.0,0
1,2021-09-29,1,0.0,0.0,0.0,0
2,2021-09-29,2,0.0,0.0,0.0,0
3,2021-09-29,3,0.0,0.0,0.0,0
4,2021-09-29,4,0.0,0.0,0.0,0


### Conclusion

#### Since we have work to do with this dataset further on, we'll export the clean & ready-to-go dataframe to a csv.

In [14]:
energia_2021.to_csv("energia_2021.csv")