# Get the license with most stars using Xorbits dataset over bigcode/the-stack Hugging Face dataset

In this notebook, we will demonstrate how to use Xorbits to get the license with most stars over the bigcode/the-stack dataset.
The Stack contains over 6TB of permissively-licensed source code files covering 358 programming languages. The dataset was created 
as part of the BigCode Project, an open scientific collaboration working on the responsible development of Large Language Models 
for Code (Code LLMs). This notebook focus on the Python language.

## Software versions
- Xorbits[datasets]>=0.5.1

In [None]:
# Install dependencies
%pip install "xorbits[datasets]>=0.5.1"

## Load dataset

This step loads the Hugging Face dataset in parallel. First of all, you need to go to the dataset page https://huggingface.co/datasets/bigcode/the-stack
to fill in your email to obtain authorization. Then get the access token in the access tokens tab of the setting page.

In [None]:
import xorbits.pandas as pd
import xorbits.datasets as xdatasets
# bigcode/the-stack need your access key, please refer to https://huggingface.co/datasets/bigcode/the-stack
ds = xdatasets.from_huggingface("bigcode/the-stack", data_dir="data/python", split="train", token="<YOUR ACCESS TOKEN>")
# Use ArrowDtype to reduce memory usage.
pdf = ds.to_dataframe(types_mapper=pd.ArrowDtype)
# Eval the dataframe trigger xorbits execution, the download will be in parallel.
pdf

## Process Data

We are only interested in these few columns of data, so filter columns first to reduce memory footprint.

In [None]:
pdf = pdf[["max_stars_repo_name", "max_stars_repo_licenses", "max_stars_count"]]
pdf

The licenses are on the repo, so we need to dedup the data by `max_stars_repo_name`. As the data shown above, the `max_stars_repo_licenses`
is a list of string, we need to convert the value to string for the subsequent groupby.

In [None]:
pdf["max_stars_repo_licenses"] = pdf["max_stars_repo_licenses"].map(lambda x: x[0], dtype="str")
pdf["max_stars_repo_name"] = pdf["max_stars_repo_name"].map(lambda x: x.split("/")[-1], dtype="str")
pdf = pdf.drop_duplicates(subset=["max_stars_repo_name"])
pdf

## Analysis

Let's get the final result of the licenses with most stars.

In [None]:
result = pdf.groupby("max_stars_repo_licenses")["max_stars_count"].sum()
result.sort_values(ascending=False)[:5]

## Conclusion

In conclusion, Xorbits dataset is a powerful tool for loading and analyzing large datasets. By following the steps outlined in this notebook, you can gain a better understanding of the capabilities of Xorbits, its ease-of-use, and how it can be integrated with other Python libraries to streamline your data analysis workflow.