# Text Deduplication using Xorbits over OSCAR Corpus
In this notebook, we will demonstrate how to use Xorbits to perform text deduplication over the OSCAR corpus. The OSCAR corpus is a massive multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the GPT-2 model. The OSCAR corpus is available for multiple languages and we will use the Galician (gl) dataset in this example.

## Software versions
* xorbits>=0.4.4
* datasets==2.13.1

In [None]:
# Install dependencies
%pip install xorbits>=0.4.4 datasets==2.13.1

## Dataset
[OSCAR Corpus 2201](https://huggingface.co/datasets/oscar-corpus/OSCAR-2201)
First, we need to download and load the Galician (gl) dataset from the OSCAR corpus:

In [None]:
from datasets import load_dataset

data = load_dataset("oscar-corpus/OSCAR-2201",
                    use_auth_token=True,
                    language="gl", 
                    split="train")

## Data loading
The second step is to load the data into an Xorbits DataFrame. This can be done using the DataFrame constructor, which allows us to specify the data and the chunk size:

In [None]:
import xorbits.pandas as pd

df = pd.DataFrame(data.to_pandas(), chunk_size=1000)

Once we have the data loaded into a DataFrame, we might want to get a sense of the overall structure of the data by looking at the number of rows and columns, the data types of each column, and the first few rows of the data. We can do this using the shape, dtypes, and head() attributes, respectively:

In [None]:
df.shape
df.dtypes
df.head()

## Text Deduplication
We can perform text deduplication using the dedup() function from the xorbits.experimental module. This function takes a DataFrame and a column name as arguments and returns a DataFrame with duplicate rows removed:

In [None]:
from xorbits.experimental import dedup

res = dedup(df, col="text")

## Analysis
Let's see the result and the number of duplicated texts removed:

In [None]:
print(res)
print("Number of duplicated texts removed: ", df.shape[0] - res.shape[0])

## Conclusion
In conclusion, Xorbits is an incredibly powerful tool for exploring and analyzing large datasets, as demonstrated by its use with the OSCAR corpus for text deduplication. By following the steps outlined in this notebook, you can gain a better understanding of the capabilities of Xorbits, its ease-of-use, and how it can be integrated with other Python libraries to streamline your data analysis workflow.