# Review module

**Instructions**

In order to complete this review module, we recommend you follow these instructions:

1. Complete the functions provided to you in this notebook, but do **not** change the name of the function or the name(s) of the argument(s). If you do that, the autograder will fail and you will not receive any points.
2. Run all the function-definition cells before you run the testing cells. The functions must exist before they are graded!
3. Read the function docstrings carefully. They contain additional information about how the code should look (a [docstring](https://www.datacamp.com/community/tutorials/docstrings-python) is the stuff that comes between the triple quotes).
4. Some functions may require several outputs (the docstrings tell you which ones). Make sure they are returned in the right order.

## The dataset

The dataset `data/chip_plants.txt` is a table of chip manufacturing plants taken from [Wikipedia](https://en.wikipedia.org/wiki/List_of_semiconductor_fabrication_plants) and is in the [Wiki Markup](https://en.wikipedia.org/wiki/Help:Introduction_to_editing_with_Wiki_Markup/2) format. In this file, each cell starts with the character `|`, like this:

~~~plain
|[[Texas Instruments]]
~~~

Square brackets  (`[[]]`) represent links to Wikipedia pages. In the example above, the code directs you to [this page](https://en.wikipedia.org/wiki/Texas_Instruments).

Each cell is on its own line, and table rows are demarcated by this code: `|-`. Thus,

~~~plain
|-
|[[Texas Instruments]] (formerly [[Semiconductor Manufacturing International Corporation|SMIC]] - Cension)
|Chengdu (CFAB)
| China {{flagicon|China}}, Chengdu
|
|
|200
|
|
|
|-
|[[Tsinghua Holdings|Tsinghua Unigroup]]<ref name="eetasia.com"/>
|
| China {{flagicon|China}}, Nanjing
|10 (first phase), 30
|Planned
|300
|
|100,000 (first phase)
|3D NAND Flash
|-
~~~

produces this:

![Table example](data/images/table_example.png)

Finally, `<ref name="example.com"/>` adds a footnote with a hyperlink (see the example in the second row above) and `{{flagicon|country_name}}` adds a country flag.

The file has two columns - `line` is the actual Wikipedia cells and `id` is an identifier of the chip manufacturing plant.

In [None]:
import pandas as pd

chips = pd.read_csv("data/chip_plants.csv")
chips

### Exercise 1

Remove all the rows that contain the separator `|-`.

**Hint:** Be aware that the dataset contains some cells with leading/trailing spaces, which can hide duplicate values. You will need to remove the padding before dropping the `|-` cells.

In [None]:
def remove_separators(chips):
    """
    Removes all rows that consist of the separator `|-`
    
    Arguments:
    `chips`: A pandas DataFrame
    
    Outputs:
    `chips`: A pandas DataFrame (modified version of the input DataFrame)
    """
    
    # YOUR CODE HERE
    raise NotImplementedError() # Remove this line when you enter your solution
    
    return chips

### Exercise 2

Write a function that does the following to the `line` column (in this order):

1. Remove `|`s (note: this is the vertical bar character, not a letter).
2. Remove `[`s.
3. Remove `]`s.
4. Remove leading and trailing spaces.
5. Remove footnotes, flags, and other metadata; that is, everything that is between the `<` and `>` characters, and everything that is between the `{` and `}` characters.

We have provided you with the code to do the last task above. In that code, we use [regular expressions](https://www.w3schools.com/python/python_regex.asp), with which you might not be familiar with yet. We will cover them in other cases, but for now think of them as a way of defining search patterns in text data.

In [None]:
def clean_cells(chips):
    """
    Removes unwanted characters from the `line` column after having
    removed the rows that consist of the separator `|-`
    
    Arguments:
    `chips`: A pandas DataFrame
    
    Outputs:
    `chips`: A pandas DataFrame (modified version of the input DataFrame)
    """
    chips = remove_separators(chips)
    
    # YOUR CODE HERE
    raise NotImplementedError() # Remove this line when you enter your solution
    
    chips_line_clean = chips_line_clean.str.replace(r"\<.*\>", "", regex=True)
    chips_line_clean = chips_line_clean.str.replace(r"\{.*\}", "", regex=True)
    
    chips = chips.assign(line=chips_line_clean)
    
    return chips

### Exercise 3

Each table row in the Wikipedia dataset has to have exactly 9 columns:

1. `company`
2. `plant_name`
3. `plant_location`
4. `plant_cost_us_billions`
5. `started_production`
6. `wafer_size`
7. `process_technology`
8. `production_capacity`
9. `technology_products`

Come up with a way to assign these columns to the cells so that they look like this (notice that the column name resets every 9 rows):

<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>id</th>      <th>line</th>      <th>columns</th>    </tr>  </thead>  <tbody>    <tr>      <th>0</th>      <td>0</td>      <td>United Microelectronics CorporationUMC - Hejian Technology CorporationHe Jian</td>      <td>company</td>    </tr>    <tr>      <th>1</th>      <td>0</td>      <td>Fab 8N</td>      <td>plant_name</td>    </tr>    <tr>      <th>2</th>      <td>0</td>      <td>China</td>      <td>plant_location</td>    </tr>    <tr>      <th>3</th>      <td>0</td>      <td>0.750, 1.2, +0.5</td>      <td>plant_cost_us_billions</td>    </tr>    <tr>      <th>4</th>      <td>0</td>      <td>2003, May</td>      <td>started_production</td>    </tr>    <tr>      <th>5</th>      <td>0</td>      <td>200</td>      <td>wafer_size</td>    </tr>    <tr>      <th>6</th>      <td>0</td>      <td>4000–1000, 500, 350, 250, 180, 110</td>      <td>process_technology</td>    </tr>    <tr>      <th>7</th>      <td>0</td>      <td>77,000</td>      <td>production_capacity</td>    </tr>    <tr>      <th>8</th>      <td>0</td>      <td>Foundry</td>      <td>technology_products</td>    </tr>    <tr>      <th>10</th>      <td>1</td>      <td>United Microelectronics CorporationUMC</td>      <td>company</td>    </tr>    <tr>      <th>11</th>      <td>1</td>      <td>Fab 6A</td>      <td>plant_name</td>    </tr>    <tr>      <th>12</th>      <td>1</td>      <td>Taiwan , Hsinchu</td>      <td>plant_location</td>    </tr>    <tr>      <th>13</th>      <td>1</td>      <td>0.35</td>      <td>plant_cost_us_billions</td>    </tr>    <tr>      <th>14</th>      <td>1</td>      <td>1989</td>      <td>started_production</td>    </tr>    <tr>      <th>15</th>      <td>1</td>      <td>150</td>      <td>wafer_size</td>    </tr>    <tr>      <th>16</th>      <td>1</td>      <td>450</td>      <td>process_technology</td>    </tr>    <tr>      <th>17</th>      <td>1</td>      <td>31,000</td>      <td>production_capacity</td>    </tr>    <tr>      <th>18</th>      <td>1</td>      <td>Foundry</td>      <td>technology_products</td>    </tr>    <tr>      <th>20</th>      <td>2</td>      <td>United Microelectronics CorporationUMC</td>      <td>company</td>    </tr>    <tr>      <th>21</th>      <td>2</td>      <td>Fab 8AB</td>      <td>plant_name</td>    </tr>    <tr>      <th>22</th>      <td>2</td>      <td>Taiwan , Hsinchu</td>      <td>plant_location</td>    </tr>    <tr>      <th>23</th>      <td>2</td>      <td>1</td>      <td>plant_cost_us_billions</td>    </tr>    <tr>      <th>24</th>      <td>2</td>      <td>1995</td>      <td>started_production</td>    </tr>    <tr>      <th>25</th>      <td>2</td>      <td>200</td>      <td>wafer_size</td>    </tr>    <tr>      <th>...</th>      <td>...</td>      <td>...</td>      <td>...</td>    </tr>  </tbody></table>

The column you add has to be called `columns` (the tests will fail otherwise). <br>


<details>
    <summary markdown="span">
        <br>Click here for a <b>Hint</b>
    </summary>
    <blockquote>
        You can repeat a list <code>n</code> times by running <code>my_list * n</code>.
    </blockquote>
</details>

In [None]:
def assign_column_names(chips):
    """
    Assigns column names to the `chips` DataFrame after having
    cleaned the `list` column.
    
    Arguments:
    `chips`: A pandas DataFrame
    
    Outputs:
    `chips`: A pandas DataFrame (a modified version of the original input
    with the added column `columns`)
    """
    
    chips = clean_cells(chips)
    
    # YOUR CODE HERE
    raise NotImplementedError() # Remove this line when you enter your solution
    
    return chips

### Exercise 4

Now pivot the data so that rows are individual chip manufacturing plants, and columns are the values in `columns`. In other words, make the data *wide*, just as it is in the [original Wikipedia page](https://en.wikipedia.org/wiki/List_of_semiconductor_fabrication_plants). Then, save it to the `data` folder with the name `clean_table.csv`.

When saving your CSV file, please export it *without* the `id` index column (i.e., set `index=False`).

<details>
    <summary markdown="span">
        <br>Click here for a <b>Hint</b>
    </summary>
    <blockquote>
        Use the <a href="https://pandas.pydata.org/docs/user_guide/reshaping.html#reshaping-by-pivoting-dataframe-objects"><code>.pivot()</code></a> method method. This method lets you reshape a DataFrame by defining which variables should be treated as the index, the columns, and the values in the output DataFrame.
    </blockquote>
</details>

In [None]:
def make_df_wide(chips):
    """
    Pivots the `chips` DataFrame and saves it as `clean_table.csv`,
    after having assigned a new column and having cleaned the data.
    
    Arguments:
    `chips`: A pandas DataFrame
    
    Outputs:
    No outputs.
    """
    chips = assign_column_names(chips)
    
    # YOUR CODE HERE
    raise NotImplementedError() # Remove this line when you enter your solution

## Testing Cells

Run the below cells to check your answers. Make sure you run your solution cells first before running the cells below, otherwise you will get a `NameError` when checking your answers.

In [None]:
# Ex 1
assert type(remove_separators(chips)) == type(pd.Series([62]).to_frame()), "Ex. 1 - Your output is not a DataFrame! Make sure you don't change the data type of the `chips` DataFrame."
assert len(remove_separators(chips)) == 4734, "Ex. 1 - Your output has too many / too few rows! Did you remove the leading/trailing spaces? You can do that using `.strip()`"
print("Exercise 1 looks correct!")

In [None]:
# Ex 2
assert type(clean_cells(chips)) == type(pd.Series([62]).to_frame()), "Ex. 2 - Your output is not a DataFrame! Make sure you don't change the data type of the `chips` DataFrame."
s = clean_cells(chips)["line"].str.contains("flagicon").sum() \
    + clean_cells(chips)["line"].str.contains("<ref").sum() \
    + clean_cells(chips)["line"].str.contains("|", regex=False).sum() \
    + clean_cells(chips)["line"].str.contains("[", regex=False).sum() \
    + clean_cells(chips)["line"].str.contains("]", regex=False).sum()
assert s==0, "Ex. 2 - Your output still seems to contain one or more of the unwanted characters! Remember that you can use `.str.replace()` to replace characters! (You might want to set the `regex` argument to `False`))"

print("Exercise 2 looks correct!")

In [None]:
# Ex 3
assert type(assign_column_names(chips)) == type(pd.Series([62]).to_frame()), "Ex. 3 - Your output is not a DataFrame! Make sure you don't change the data type of the `chips` DataFrame."
assert "columns" in assign_column_names(chips).columns, "Ex, 3 - Your DataFrame doesn't have a column called `columns`!"
m = assign_column_names(chips).groupby("columns")["id"].count().mean()
assert m == 526.0, "Ex. 3 - The Wikipedia table has 526 rows, but in your output one or more of the columns in `columns` appears more than 526 times. Check with `assign_column_names(chips).groupby('columns')['id'].count()`"
print("Exercise 3 looks correct!")

In [None]:
# Ex 4
make_df_wide(chips)
try:
    d = pd.read_csv("data/clean_table.csv")
except:
    print("Ex. 4 - The file `data/clean_table.csv` doesn't exist!")
    raise FileNotFoundError("Ex. 4 - The file `data/clean_table.csv` doesn't exist!")
assert d.shape == (526,9), "Ex. 4 - Your output should have 526 rows and 9 columns, but it doesn't. Did you pivot it? Hint: Use `.pivot()` instead of `.pivot_table()`. Also, did you forget to export the table without the plant id?"
assert set(d.columns) == set(['company', 'plant_cost_us_billions', 'plant_location', 'plant_name',
                              'process_technology', 'production_capacity', 'started_production',
                              'technology_products', 'wafer_size']), "Ex. 4 - Your DataFrame doesn't have all the required columns! Did you pivot it? Hint: Use `.pivot()` instead of `.pivot_table()`. Also, did you forget to export the table without the plant id?"
print("Exercise 4 looks correct!")

## Attribution

"List of semiconductor fabrication plants", 30 Apr 2021, Wikipedia, Creative Commons Attribution-ShareAlike License, https://en.wikipedia.org/wiki/List_of_semiconductor_fabrication_plants