arXiv kaggle dataset exploratation


In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Arxiv-Exploration") \
    .master("local[*]") \
    .config("spark.jars.packages", 
            "org.elasticsearch:elasticsearch-spark-30_2.12:8.8.2") \
    .getOrCreate()

import the data

In [2]:
# Path to the JSON file (adjust if necessary)
data_path = "../../data/arxiv-metadata-oai-snapshot.json"

# Load the JSON file into a DataFrame, selecting only the fields of interest
df = spark.read.json(data_path) \
        .select("id", "title", "abstract", "authors", "categories")

# Print the schema and number of records to verify
df.printSchema()
print("Total records:", df.count())
df.show(5)

root
 |-- id: string (nullable = true)
 |-- title: string (nullable = true)
 |-- abstract: string (nullable = true)
 |-- authors: string (nullable = true)
 |-- categories: string (nullable = true)

Total records: 2694879
+---------+--------------------+--------------------+--------------------+---------------+
|       id|               title|            abstract|             authors|     categories|
+---------+--------------------+--------------------+--------------------+---------------+
|0704.0001|Calculation of pr...|  A fully differe...|C. Bal\'azs, E. L...|         hep-ph|
|0704.0002|Sparsity-certifyi...|  We describe a n...|Ileana Streinu an...|  math.CO cs.CG|
|0704.0003|The evolution of ...|  The evolution o...|         Hongjun Pan| physics.gen-ph|
|0704.0004|A determinant of ...|  We show that a ...|        David Callan|        math.CO|
|0704.0005|From dyadic $\Lam...|  In this paper w...|Wael Abu-Shammala...|math.CA math.FA|
+---------+--------------------+-------------------

Now let's clean and normalize the fields:

Remove any newline characters or excessive whitespace in titles and abstracts.

(Optionally) convert text to lowercase for consistent processing.

Clean the authors field (e.g., remove line breaks, unify separators).

Split the categories field into an array of individual category codes, since categories in the raw data might be a single string of space-separated codes

In [6]:
from pyspark.sql.functions import regexp_replace, trim, lower, split

# Remove newlines and excessive whitespace in title and abstract, and trim
df_clean = df.withColumn("title", trim(regexp_replace("title", r"[\r\n]+", " "))) \
             .withColumn("abstract", trim(regexp_replace("abstract", r"[\r\n]+", " "))) \
             .withColumn("title", regexp_replace("title", r"\s+", " ")) \
             .withColumn("abstract", regexp_replace("abstract", r"\s+", " "))

# Optionally, make text lowercase (for consistent analysis, though Elasticsearch will handle casing)
df_clean = df_clean.withColumn("title", lower("title")) \
                   .withColumn("abstract", lower("abstract")) \
                   .withColumn("authors", lower("authors"))

# Normalize authors: replace ' and ' with comma, remove trailing ' and'
df_clean = df_clean.withColumn("authors", regexp_replace("authors", r"\sand\s", ", ")) \
                   .withColumn("authors", regexp_replace("authors", r"\s+", " "))

# Split categories into array of category codes
df_clean = df_clean.withColumn("categories", split("categories", " "))


df_clean.printSchema()
print("Total records:", df.count())
df_clean.show(5)

root
 |-- id: string (nullable = true)
 |-- title: string (nullable = true)
 |-- abstract: string (nullable = true)
 |-- authors: string (nullable = true)
 |-- categories: array (nullable = true)
 |    |-- element: string (containsNull = false)

Total records: 2694879
+---------+--------------------+--------------------+--------------------+------------------+
|       id|               title|            abstract|             authors|        categories|
+---------+--------------------+--------------------+--------------------+------------------+
|0704.0001|calculation of pr...|a fully different...|c. bal\'azs, e. l...|          [hep-ph]|
|0704.0002|sparsity-certifyi...|we describe a new...|ileana streinu, l...|  [math.CO, cs.CG]|
|0704.0003|the evolution of ...|the evolution of ...|         hongjun pan|  [physics.gen-ph]|
|0704.0004|a determinant of ...|we show that a de...|        david callan|         [math.CO]|
|0704.0005|from dyadic $\lam...|in this paper we ...|wael abu-shammala...

The above transformations:
- Use regexp_replace to replace newlines (\r\n) with spaces and condense multiple spaces to one.
- Lowercase the text in title, abstract, authors.
- In authors, attempt to replace the word " and " with a comma+space, so that authors like "Smith and Doe" become "Smith, Doe", then all multiple spaces condensed. This way authors are separated uniformly by commas.
- Split the categories string on spaces into an array (e.g., "cs.AI cs.CL" becomes ["cs.AI","cs.CL"]).

## Text Processing and Feature Extraction

#### Tokenization and Stopword Removal

We will tokenize the abstract (and optionally title) text into words, and then remove common stop words (like "the", "and", "of", etc.) which are not useful in search queries. PySpark's ML library provides Tokenizer/RegexTokenizer and StopWordsRemover for this purpose. Let's tokenize the abstracts into words:

In [7]:
from pyspark.ml.feature import RegexTokenizer, StopWordsRemover

# Tokenizer to split text on non-word characters (this will break text into words)
tokenizer = RegexTokenizer(inputCol="abstract", outputCol="words", pattern="\\W")
words_data = tokenizer.transform(df_clean)

# Remove stop words
remover = StopWordsRemover(inputCol="words", outputCol="filtered_words")
filtered_data = remover.transform(words_data)

filtered_data.select("abstract", "filtered_words").show(3, truncate=80)


+--------------------------------------------------------------------------------+--------------------------------------------------------------------------------+
|                                                                        abstract|                                                                  filtered_words|
+--------------------------------------------------------------------------------+--------------------------------------------------------------------------------+
|a fully differential calculation in perturbative quantum chromodynamics is pr...|[fully, differential, calculation, perturbative, quantum, chromodynamics, pre...|
|we describe a new algorithm, the $(k,\ell)$-pebble game with colors, and use ...|[describe, new, algorithm, k, ell, pebble, game, colors, use, obtain, charact...|
|the evolution of earth-moon system is described by the dark matter field flui...|[evolution, earth, moon, system, described, dark, matter, field, fluid, model...|
+---------------

#### Computing TF-IDF Features
Next, we compute TF-IDF (Term Frequency–Inverse Document Frequency) vectors for the documents. TF-IDF is a numerical statistic that reflects how important a word is to a document in a corpus​. 

We will use Spark's HashingTF to hash words into term frequency vectors, then IDF to scale them:

In [8]:
from pyspark.ml.feature import HashingTF, IDF

# HashingTF to map the filtered words to a fixed-length feature vector
hashingTF = HashingTF(inputCol="filtered_words", outputCol="rawFeatures", numFeatures=10000)
featurized_data = hashingTF.transform(filtered_data)
# Compute the IDF model on the corpus
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurized_data)
rescaled_data = idfModel.transform(featurized_data)

# Check the TF-IDF feature vector for a sample
rescaled_data.select("id", "features").show(3, truncate=80)


+---------+--------------------------------------------------------------------------------+
|       id|                                                                        features|
+---------+--------------------------------------------------------------------------------+
|0704.0001|(10000,[134,157,282,436,717,944,1072,1080,1113,1161,1226,1253,1439,1481,1695,...|
|0704.0002|(10000,[221,274,310,521,585,625,705,870,885,1055,1296,1468,1541,2093,2241,225...|
|0704.0003|(10000,[20,157,253,258,316,327,399,617,697,735,922,1016,1038,1622,1695,1896,1...|
+---------+--------------------------------------------------------------------------------+
only showing top 3 rows

