# Requirements
Basic knowledge of Python and Pandas is needed to follow this notebook. Check the subjects listed in these courses:

- [Python course](https://www.kaggle.com/learn/python)
- [Pandas course](https://www.kaggle.com/learn/pandas)
- Check out [this](https://www.kaggle.com/ponybiam/introduction-to-ifcopenshell-functions) and [this](https://www.kaggle.com/ponybiam/ifc-parsing-example) notebooks to get familiar with the package `ifcopenshell`.

# Load packages
First we are going to install the `ifcopenshell` package. IfcOpenShell is an open source software library that helps users and software developers to work with the IFC file format. The IFC file format can be used to describe building and construction data. The format is commonly used for Building Information Modelling (BIM).

Run the following code to install the package in the curren environment:

In [None]:
conda install -c conda-forge -c oce -c dlr-sc -c ifcopenshell ifcopenshell

And now we import the packages we are going to use in this notebook:

In [None]:
import pandas as pd
import ifcopenshell
import random

# Load dataset
We are going to use the parsed IFC files we obtained in our [previous notebook](https://www.kaggle.com/ponybiam/ifc-parsing-example). We exported the file as `ifc_parsed_data.csv`:

In [None]:
data = pd.read_csv("/kaggle/input/example-ifc-file/ifc_parsed_data.csv")
data.head()

Now we are going to add some fake data; the only purpose of this is to have a continuous variable to play around with some Pandas functions. Imagine we have a column name `price` with the cost of each element. We are going to create this column with a random value between 500 and 5000 (this, for sure, won't be realistic!) with the function `uniform` from the package `random` ([documentation here](https://docs.python.org/3/library/random.html#random.uniform)). Let's try it out:

In [None]:
random.uniform(500, 5000)

Try running it several times: you will obtain a different value each time. We don't need so many decimals for our dataset, we can use the `round` function to get only 2:

In [None]:
# A random number
number = random.uniform(500, 5000)
# Let's round it to 2 decimal
rounded_number = round(number, 2)
rounded_number

We can create a simple function that gets through both steps:

In [None]:
def create_random_number(min, max):
    return round(random.uniform(min, max), 2)

An now we have to add the column to our dataset. We need to create a random value for each row in our dataset; a very usefull Python feature can accomplish this task easily: [list comprehensions](https://www.kaggle.com/colinmorris/loops-and-list-comprehensions). For example, let's create a list with the double of each number from 0 to 10:

In [None]:
# First, we get the numbers from 1 to 10. This can be done with "range" (last number is excluded and by default starts in 0)
my_numbers = range(11)
# And now we create the list with list comprehension
[2*number for number in my_numbers]

This can be done with any function we want (insted of using `2*number`). In our case, we want to create a random number with the function `create_random_number` for each row of our dataset:

In [None]:
# First, we get the length of our dataset
last_number = len(data)
# Now we create the list to iterate over
my_numbers = range(last_number)
# And finally, we create the colum in pour dataset
data["price"] = [create_random_number(500, 5000) for i in my_numbers]

data.head()

# Dataset exploration
- [Grouping](#Grouping)
- [Merging](#Merging)
- [Unique values](#Unique-values)
- count values by building 
- sorting by price
- fill nan in description using map
- new_id concating strings

## Grouping
Now that we have the price of each element, how can we get the **mean price of each element type**? We can use the method `groupby`; keep this method always in mind cause you are going to use it a lot.

You need two things to group your data:
- column (or columns) to group by (of course)
- an aggregation function

Let's start by counting how many elements if each `element_type` we have in our dataset:

In [None]:
data.groupby("element_type").count()

Here, as we didn't specified which column we want to count, you get the count for each one of the columns in our dataset (and we get a pandas dataframe in return). Let's try out choosing only one:

In [None]:
data.groupby("element_type")["element_id"].count()

And we get a pandas series in return. Let's calculate the mean price:

In [None]:
data.groupby("element_type")["price"].mean()

## Merging
Imagine we want to add the mean price to our dataset. We shoul add to each row the corresponden mean price (depending on the element type). Forthis purpose we can create a dataset with the mean prices and merge it with our original dataset. This is called *merging* or *joining*; I strongly suggest [checking out the documentatio](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html), this is another of the methods you will me using a lot.

First we create a dataset with our mean prices:

In [None]:
# First, we group and obtain the mean
mean_price_serie = data.groupby("element_type")["price"].mean()
# Then, we convert it into a dataframe
mean_price = pd.DataFrame(mean_price_serie)
# And we reset the index
mean_price.reset_index(inplace=True)
# We rename the column
mean_price.rename(columns={"price":"mean_price"}, inplace=True)
# Check it out
mean_price

And now is time to merge. We will need:

- Two dataframes
- A common column, present in both dataframe, to merge over
- How to join

Here is illustrated how two datasets can be joined ([image source](https://data36.com/pandas-tutorial-3-important-data-formatting-methods-merge-sort-reset_index-fillna/)):
![joining](https://data36.com/wp-content/uploads/2018/08/4-pandas-merge-inner-outer-left-right-768x579.png)

We want to add the `mean_price` column to our dataset: this could be accomplished with a left join of `mean_price` dataframe on `data` dataframe.

In [None]:
data1 = pd.merge(data, mean_price, how="left", on="element_type")
data1.head()

## Unique values
When you group by a column, you get as index the unique values of that column. But, what if only need the unique values? We don't need to group the data in order to get it, we can use the method `unique`, that can  be applied to any serie. For example, let's fin the unique values of the columns `element_type` and `bdg_name`:

In [None]:
# Get the unique values
unique_element_type = data1.element_type.unique()
unique_bdg_name = data1.bdg_name.unique()
# Print it
print(f"Unique element types: {unique_element_type}\n\nUnique building names: {unique_bdg_name}")