# Tutorial: Loading and Visualizing the documents from the O4B dataset

The extracted directory will contain 7 files - 1 source and 1 target file for each of the splits, namely train, dev and test. For instance, for training set the file names will be train.source and train.target. The additional file called refs.bib consist of the bibtex reference for the articles used for creating O4B.

In both the source and target files, each line represents 1 record.

In [1]:
from langchain.document_loaders import UnstructuredFileLoader

In [2]:
!pwd

/Users/saucabadal/SiaTests/Dataset-v1/tutorials


The dataset is comprised of 13.966, 1.746, 1.746 full-text articles/summaries for train, validation and test respectively.

### Test split

In [10]:
source_loader_test = UnstructuredFileLoader("../data/open4business/Open4Business/test.source", mode="elements")
# Unstructured creates different “elements” for different chunks of text.
# by default we combine those together, but you can easily keep that separation 
# by specifying mode="elements".

target_loader_test = UnstructuredFileLoader("../data/open4business/Open4Business/test.target", mode="elements")

docs_test = source_loader_test.load()
# As each line reprents 1 record, by specifying mode="elements" each chunk of text represents one full-text article
# docs is a list containing one article on each position

summaries_test = target_loader_test.load()

In [18]:
print(f'(Test set) - Total number of full text articles: {len(docs_test)}')
print(f'(Test set) - Total number of summaries: {len(summaries_test)}')

(Test set) - Total number of full text articles: 1746
(Test set) - Total number of summaries: 1746


In [14]:
print(docs_test[0].page_content) # indexing 0 returns the first article, indexing 1 returns the second articles, etc.

With the sustained development of economy and the urbanization, the differences in economy and society between urban and rural areas and among regions around the nation have caught the public eyes, and how to balance the relationship between economy and population has become an inevitable issue. As the geographical conditions, cultural foundation and technological level of each region are different, each country faces a certain degree of regional disparity, and this disparity appears particularly conspicuous in a vast country like China. Despite the fact that many policies like the western development, the revitalization of the northeast old industrial base and the rise of the central have been made by How to cite this report: Zhu, J. Study on the Balance of Economy and Po-the government to narrow the regional gaps, there is no denying that the current state of population and economy among provinces is still very uneven. Generally speaking, the western region is backward in economy and

In [15]:
print(summaries_test[0].page_content)

Study on the Balance of Economy and Population in China during 2000-2015. From the perspective of long-term equilibrium, the proportion of a region's economy and population in the country should roughly be equal. In this report, the quotient of GDP proportion and population proportion is defined as R, whose value and characteristics of volatility accurately reflects the feature of distribution and equilibrium between economy and population. By using GIS visualization technology, this report finds that the economic and demographic distribution in Chins is still far from matching currently, with a trend of polarization between east and west. However, from 2000 to 2015, the matching degree of economy and population at the national level is actually on the rise. This report then divides apart the economic factor and demographic factor that cause the R value to change, and comes to a conclusion that the status between the economy and the population in most provinces is affected by economic 

### Validation split

In [21]:
source_loader_val = UnstructuredFileLoader("../data/open4business/Open4Business/val.source", mode="elements")
target_loader_val = UnstructuredFileLoader("../data/open4business/Open4Business/val.target", mode="elements")

docs_val = source_loader_val.load()
summaries_val = target_loader_val.load()

print(f'(Validation set) - Total number of full text articles: {len(docs_val)}')
print(f'(Validation set) - Total number of summaries: {len(summaries_val)}')

(Validation set) - Total number of full text articles: 1746
(Validation set) - Total number of summaries: 1746


### Train split

In [23]:
source_loader_train = UnstructuredFileLoader("../data/open4business/Open4Business/train.source", mode="elements")
target_loader_train = UnstructuredFileLoader("../data/open4business/Open4Business/train.target", mode="elements")

docs_train = source_loader_train.load()
summaries_train = target_loader_train.load()

print(f'(Train set) - Total number of full text articles: {len(docs_train)}')
print(f'(Train set) - Total number of summaries: {len(summaries_train)}')

(Train set) - Total number of full text articles: 13966
(Train set) - Total number of summaries: 13966
