# <center>Open Source Tools</center>

<img src="../image/tools_DALLE.jpeg" width=30% align="right" style="in-line">

>*Why reinvent the wheel when you can copy-paste someone else's,*
>
>*then spend hours figuring out why it doesn't work?*
>
>— ChatGPT 4o

<img src="../image/quote3_ChatGPT.png" width=60% align="left" style="in-line">

## Learning goals

1. Understand the role of open-source licenses.
2. Know where and how to find open-source packages for your projects.
3. Know how to quickly judge the quality and relevance of packages.

## Agenda

1. [Open-source licenses](#1)
2. [Finding and testing open-source tools](#2)
3. [Signing up for YOU-lead](#3)

<a name="1"></a>
## Agenda 1. Open-source licenses

An open source license is a legal agreement that outlines how software can be used, shared, and modified. Think of it as a set of rules provided by the software's creator that you agree to follow when you use their software.

Open-source licenses are important because they:

- **Protect the creators**: They ensure that the people who originally made the software get credit for their work.
- **Clarify users' rights**: They let the users know exactly what they are allowed to do with the software.
- **Encourage collaboration**: They make it easier for people to work together and build upon each other's work legally.

### Types of open-source licenses

- **Permissive Licenses**:
    - **MIT License**: Minimal restrictions on how the software can be used, allowing users to freely use, modify, and distribute the software.
    - **Apache License**: Similar to the MIT License but includes explicit patent rights.
    - **BSD License**: Similar to the MIT License. Minimal restrictions on redistribution. 
- **Copyleft Licenses**:
    - **GNU General Public License (GPL)**: Requires any modified versions of the software to be distributed under the same license terms.
    
Here is a great article explaining the most common licenses: [Top Open Source Licenses Explained](https://www.mend.io/blog/top-open-source-licenses-explained/).

<img src="../image/license_example.jpg" width=100% align="center" style="in-line">

Source: https://github.com/pandas-dev/pandas

### What open source licenses mean for us

- **Use software freely**: Nearly all open source software is free software (There are exceptions. You can read more [here](https://www.gnu.org/philosophy/open-source-misses-the-point.en.html).)
- **Modify software to fit our needs**: Depending on the license, we may be able to adapt the software to better suit a specific research need.
- **Share our work legally**: If a research project involves distributing software or code, open source licenses provide a legal framework to do so.

&#x1F4A1; If you ever need to **license your work**, click [here](https://choosealicense.com) for help choosing a license.

<a name="2"></a>
## Agenda 2. Finding and testing open-source tools

I recommend two places for searching for open-source tools: GitHub and JOSS. 

**GitHub** is the largest source code host in the world (source: [Wikipedia](https://en.wikipedia.org/wiki/GitHub)). You can use the [search bar](https://github.com/search) &#x1F50D; with keywords or explore [trending repositories](https://github.com/trending).

**JOSS** is the short name for the [Journal of Open Source Software](https://joss.theoj.org). It is a open access journal for ***research*** software packages. The tools published in JOSS are peer-reviewed, which means they have been  rigorously evaluated. They usually meets standards of scientific quality, usability, and proper documentation. Most JOSS packages are hosted on GitHub.

### How to quickly judge the quality and relevance of packages

When assessing the quality and relevance of a tool hosted on GitHub, consider these key criteria:

1. **Stars and forks**: Indicators of popularity

- The number of stars reflects how many users have saved the repository as a favorite.

- The number of forks shows how many users have created their own copy of the repository, often to contribute or modify the code. A high number of forks may indicate a collaborative and evolving project with active contributors.

2. **Last commit date**: Recency of updates

- A recent commit date often signifies that the project is being actively developed or maintained.

3. **Issues and pull requests**: Active maintenance

- The ratio of open to closed issues can show how actively the maintainers resolve problems. A large number of unresolved issues might signal a lack of maintenance. 
- Frequent and recent pull requests indicate active contributions and community engagement. 
- Look at the types of open issues (e.g., bugs, enhancements). Too many critical or unresolved bugs might be a red flag, while enhancement-focused issues can indicate active feature development.

4. **`README`**: Quality of documentation

- A good `README` provides a clear overview of the package, including its purpose, main features, and usage. This can help determine if the project meets your needs.
- Step-by-step installation and usage instructions makes it easier for new users to get started.
- Some projects link to additional documentation, like usage examples, or tutorial videos, which can be very helpful for understanding and using the tool effectively.

Let's look at an example: [seesus](https://github.com/caimeng2/seesus). It is an open-source package published in [JOSS](https://joss.theoj.org/papers/10.21105/joss.06244). 

&#x1F4D6; **<font color=teal>WHAT WE HAVE LEARNED: </font> Open-source licenses**: Have you noticed the license for this package? What does the license tell you?

&#x270A; **<font color=firebrick>DO THIS: </font>** Skim through the `README` file and see if, and how, we can this package to analyze the news we've collected.

In [None]:
#!pip install seesus

In [None]:
import pandas as pd
from seesus import SeeSus
from ast import literal_eval
import matplotlib.pyplot as plt

In [None]:
# import the data we cleaned last week
news = pd.read_csv("../W04_data_wrangling/news_clean.csv")

In [None]:
news

In [None]:
# a piece of text for testing
text = news.title.iloc[1]
print(text)

In [None]:
# call the function we imported
result = SeeSus(text)

In [None]:
# print the names of identified SDGs and the description of the sdg
print(result.sdg, result.sdg_desc)

Now let's try using the tool to analyze the entire dataset. This is an alternative to conducting content analysis manually.

In [None]:
sdg_title_list = []
sdg_desc_list = []

In [None]:
# warning!! this cell takes about 6 min to run
for i in range(len(news)):
    title = news.title.iloc[i]
    print("analyzing news No.", i) # to see the progress
    desc = news.desc.iloc[i]
    sdg_title = SeeSus(title).sdg
    sdg_desc = SeeSus(desc).sdg
    sdg_title_list.append(sdg_title)
    sdg_desc_list.append(sdg_desc)

In [None]:
news["sdg_title"] = sdg_title_list
news["sdg_desc"] = sdg_desc_list

In [None]:
news

<img src="../image/E_SDG_poster_UN_emblem_WEB.png" width=90% align="center" style="in-line">


Source: https://www.un.org/sustainabledevelopment/news/communications-material/

In [None]:
news.to_csv("news_processed.csv", index=False)

In [None]:
news = pd.read_csv("news_processed.csv", 
                   converters={"sdg_title": literal_eval, "sdg_desc": literal_eval,}) # the last two columns are lists, not strings

In [None]:
news

In [None]:
news.sdg_title.value_counts()
#news.sdg_desc.value_counts()

In [None]:
# combine sdg_title and sdg_desc columns and keep unique values
news['sdgs_list'] = news[['sdg_title', 'sdg_desc']].apply(lambda x: list(set([item for sublist in x for item in sublist if item])), axis=1)

&#x1F4A1; In the above cell, 
- `item for sublist in x for item in sublist if item` is a list comprehension. It creates a new list by iterating over nested lists. Read more [here](https://realpython.com/list-comprehension-python/) about list comprehension.
- `list(set(...))` removes duplicate items from a list. 
- `apply()` means applying a function along one of the axis of a DataFrame. The default is 0, which is the row axis. axis=1 means applying to a column.
- A `lambda` function is a small function without a name, often used for simple operations or as a quick, one-line function without needing a `def` statement. If you'd like to know more about `lambda` functions, check out [this article](https://realpython.com/python-lambda/).

In [None]:
news.head(7)

In [None]:
# if you'd like to remove []
news['sdgs'] = news['sdgs_list'].apply(lambda x: ', '.join(x) if x else '')

In [None]:
news.head()

In [None]:
news.sdgs.value_counts()

In [None]:
news['date'] = pd.to_datetime(news['date'])
news['month'] = news['date'].dt.to_period('M')  # 'M' period groups by month

Common time series frequencies: https://pandas.pydata.org/docs/user_guide/timeseries.html#timeseries-period-aliases

In [None]:
news.head()

In [None]:
# expand SDG list entries so each SDG has its own row
sdg_expanded = news.explode('sdgs_list')

In [None]:
sdg_expanded

In [None]:
# group by month and SDG, count occurrences
sdg_counts = sdg_expanded.groupby(['month', 'sdgs_list']).size().unstack(fill_value=0)

In [None]:
sdg_counts

In [None]:
sdg_info = {
    "SDG1": ("No Poverty", "#E5243B"), 
    "SDG2": ("Zero Hunger", "#DDA63A"), 
    "SDG3": ("Good Health and Well-being", "#4C9F38"), 
    "SDG4": ("Quality Education", "#C5192D"),
    "SDG5": ("Gender Equality", "#FF3A21"), 
    "SDG6": ("Clean Water and Sanitation", "#26BDE2"), 
    "SDG7": ("Affordable and Clean Energy", "#FCC30B"), 
    "SDG8": ("Decent Work and Economic Growth", "#A21942"),
    "SDG9": ("Industry, Innovation and Infrastructure", "#FD6925"), 
    "SDG10": ("Reduced Inequality", "#DD1367"), 
    "SDG11": ("Sustainable Cities and Communities", "#FD9D24"),
    "SDG12": ("Responsible Consumption and Production", "#BF8B2E"), 
    "SDG13": ("Climate Action", "#3F7E44"), 
    "SDG14": ("Life Below Water", "#0A97D9"), 
    "SDG15": ("Life on Land", "#56C02B"), 
    "SDG16": ("Peace and Justice Strong Institutions", "#00689D"), 
    "SDG17": ("Partnerships for the Goals", "#19486A")
}

In [None]:
plot_colors = [sdg_info[sdg][1] for sdg in sdg_counts.columns]  # get color from sdg_info
column_labels = [f"{sdg} {sdg_info[sdg][0]}" for sdg in sdg_counts.columns]

In [None]:
column_labels

In [None]:
# Plot the SDG counts
ax = sdg_counts.plot(kind='bar', stacked=True, figsize=(12, 7), color=plot_colors)
plt.title("SDG mentions in the news")
plt.xlabel("Month")
plt.ylabel("Count of SDG mentions")
plt.legend(title="Sustainable Development Goals", labels=column_labels, loc="upper left", bbox_to_anchor=(1, 1))
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

We can also calculate the overall frequency of each SDG across the news dataset.

In [None]:
# sum the occurrences of each SDG across all months
total_sdg_counts = sdg_counts.sum()

In [None]:
# plot a pie chart of overall SDG frequencies
plt.figure(figsize=(10, 8))
plt.pie(total_sdg_counts, labels=column_labels, autopct='%1.0f%%', startangle=90, colors=plot_colors)
plt.title("Overall frequency of each SDG")
plt.tight_layout()
plt.show()

&#x270A; **<font color=firebrick>DO THIS: </font>** Explore GitHub and JOSS. Install and test a tool that you find interesting.

If you'd like to have some recommendations, check out [`superblockify`](https://github.com/NERDSITU/superblockify)(visualizing and analyzing urban street networks), [`dyntapy`](https://joss.theoj.org/papers/10.21105/joss.04593)(macroscopic traffic modeling), or [`Mobilkit`](https://github.com/mindearth/mobilkit)(analyzing human mobility data).

In [None]:
# put your code here











<a name="3"></a>
## Agenda 3. Signing up for YOU-lead

### Possible topics

They can be **anything** (within the scope of this course) that you find interesting or useful. They can be a use case of what you have learned from the class or a new tool related to what you have learned. The following are some examples.

**Use cases**: scraping traffic data, creating a dashboard using PowerBI, demonstrating how to use GitHub for teamwork, analyzing survey data with R.

**Methods**: machine learning, agent-based modeling, network analysis, traffic simulation, spatial analysis, regular expressions.

**Packages**: pandas, NumPy, scikit-learn, TensorFlow, PyTorch, matplotlib, seaborn, plotly, folium, shapely, geopandas.

**Tools**: VS Code, Google Colab, JupyterHub.

### &#x270A; **<font color=firebrick>DO THIS: </font>**  Sign up [here](https://docs.google.com/spreadsheets/d/1esQtwJurQ6PXKDdeY0r3jsSK4SU1G4MgjCqUpfYVKSY/edit?gid=0#gid=0) now.


---------
### Congratulations, we are done!

This notebook is written by [Meng Cai](https://www.verkehr.tu-darmstadt.de/vv/das_institut_ivv/team_ivv/wissenschaftliche_mitarbeiter_doktoranden/meng_cai/standardseite_204.de.jsp), Technical University of Darmstadt. This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/">Creative Commons Attribution-NonCommercial 4.0 International License</a>.
<a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png" /></a>