# Import Data

## Article

Data from "Flores 200"

Occasion: Article (News, report)

Data bibliography

@article{nllb-22, title = {No Language Left Behind: Scaling Human-Centered Machine Translation}, author = {{NLLB Team} and Costa-jussà, Marta R. and Cross, James and Çelebi, Onur and Elbayad, Maha and Heafield, Kenneth and Heffernan, Kevin and Kalbassi, Elahe and Lam, Janice and Licht, Daniel and Maillard, Jean and Sun, Anna and Wang, Skyler and Wenzek, Guillaume and Youngblood, Al and Akula, Bapi and Barrault, Loic and Mejia-Gonzalez, Gabriel and Hansanti, Prangthip and Hoffman, John and Jarrett, Semarley and Sadagopan, Kaushik Ram and Rowe, Dirk and Spruit, Shannon and Tran, Chau and Andrews, Pierre and Ayan, Necip Fazil and Bhosale, Shruti and Edunov, Sergey and Fan, Angela and Gao, Cynthia and Goswami, Vedanuj and Guzmán, Francisco and Koehn, Philipp and Mourachko, Alexandre and Ropers, Christophe and Saleem, Safiyyah and Schwenk, Holger and Wang, Jeff}, year = {2022}, eprint = {arXiv:1902.01382}, }

In [29]:
import requests
import pandas as pd

In [104]:
def get_urls(owner, repo, path):
    # Construct the URL for the GitHub API
    api_url = f"https://api.github.com/repos/{owner}/{repo}/contents/{path}"
    
    # Fetch the content of the GitHub repository
    response = requests.get(api_url)
    
    # Check if the request was successful
    if response.status_code == 200:
        # Parse the response to extract file names, file objects, and file URLs
        names = [file['name'] for file in response.json()]
        urls = [f"https://raw.githubusercontent.com/{owner}/{repo}/main/{path}/{name}" for name in names]  # URLs to each file
        
        # Print the list of files and their corresponding URLs
        print("Files in the GitHub repository folder:")
        for name, url in zip(names, urls):
            print(f"{name}: {url}")
        
        return names, urls
    else:
        # If the request was not successful, print an error message
        print(f"Error: Failed to fetch content from GitHub repository. Status code: {response.status_code}")
        return [], []

In [105]:
# GitHub repository information
owner = "rnagamatsu"
repo = "textanalytics_group2"
path = "flores200_group2"

# Get the list of files in the GitHub repository folder
files_names, files_urls = get_urls(owner, repo, path)

Files in the GitHub repository folder:
eng_Latn.dev: https://raw.githubusercontent.com/rnagamatsu/textanalytics_group2/main/flores200_group2/eng_Latn.dev
hin_Deva.dev: https://raw.githubusercontent.com/rnagamatsu/textanalytics_group2/main/flores200_group2/hin_Deva.dev
ind_Latn.dev: https://raw.githubusercontent.com/rnagamatsu/textanalytics_group2/main/flores200_group2/ind_Latn.dev
jpn_Jpan.dev: https://raw.githubusercontent.com/rnagamatsu/textanalytics_group2/main/flores200_group2/jpn_Jpan.dev
kor_Hang.dev: https://raw.githubusercontent.com/rnagamatsu/textanalytics_group2/main/flores200_group2/kor_Hang.dev
spa_Latn.dev: https://raw.githubusercontent.com/rnagamatsu/textanalytics_group2/main/flores200_group2/spa_Latn.dev
tha_Thai.dev: https://raw.githubusercontent.com/rnagamatsu/textanalytics_group2/main/flores200_group2/tha_Thai.dev
vie_Latn.dev: https://raw.githubusercontent.com/rnagamatsu/textanalytics_group2/main/flores200_group2/vie_Latn.dev


In [106]:
def process_flores(files_names, files_urls):
    dfs = []

    for index, (name, url) in enumerate(zip(files_names, files_urls)):
        response = requests.get(url)
        if response.status_code == 200:
            print(f"Downloaded and processed file {index+1}/{len(files_names)}: {name}")
            text = response.text
            df_name = name[:3]
            df = pd.DataFrame(text.split('\n'), columns=[df_name])
            dfs.append(df)
        else:
            print(f"Error: Failed to download file {name}")

    # Concatenate the DataFrames along the columns axis
    if dfs:
        concatenated_df = pd.concat(dfs, axis=1)
        return concatenated_df
    else:
        print("No DataFrames to concatenate.")
        return None

In [107]:
texts = process_flores(files_names, files_urls)

Downloaded and processed file 1/8: eng_Latn.dev
Downloaded and processed file 2/8: hin_Deva.dev
Downloaded and processed file 3/8: ind_Latn.dev
Downloaded and processed file 4/8: jpn_Jpan.dev
Downloaded and processed file 5/8: kor_Hang.dev
Downloaded and processed file 6/8: spa_Latn.dev
Downloaded and processed file 7/8: tha_Thai.dev
Downloaded and processed file 8/8: vie_Latn.dev


In [112]:
texts

Unnamed: 0,eng,hin,ind,jpn,kor,spa,tha,vie
0,"On Monday, scientists from the Stanford Univer...","सोमवार को, स्टैनफ़ोर्ड यूनिवर्सिटी स्कूल ऑफ़ म...",Ilmuwan dari Stanford University School of Med...,月曜日にスタンフォード大学医学部の科学者たちは、細胞を種類別に分類できる新しい診断ツールを発...,스탠포드 의과대학 연구진은 지난 월요일 세포를 유형별로 분류할 수 있는 새로운 진단...,"El lunes, los científicos de la facultad de me...",เมื่อวันจันทร์ที่ผ่านมา นักวิทยาศาสตร์จากโรงเร...,"Vào hôm thứ Hai, các nhà khoa học thuộc Trường..."
1,Lead researchers say this may bring early dete...,शोधकर्ताओं ने कहा है कि यह अल्प आय वाले देशों ...,Ketua peneliti mengatakan bahwa diagnosis ini ...,主任研究者は、これは低所得国における患者の癌、結核、HIV、マラリアの早期発見につながる可能...,수석 연구진들은 이것이 선진국 대비 절반의 생존율을 보이는 저소득 국가들의 환자들에...,Los principales investigadores principales sos...,นักวิจัยชั้นนำกล่าวว่าสิ่งนี้อาจทำให้มีการตรวจ...,Các nhà nghiên cứu chính nói rằng điều này có ...
2,The JAS 39C Gripen crashed onto a runway at ar...,स्थानीय समय (0230 UTC) के मुताबिक करीब 9:30 बज...,JAS 39C Gripen jatuh ke landasan pacu sekitar ...,JAS 39Cグリペンは現地時間の午前9時30分頃（UTC 0230）に滑走路に墜落して爆発...,현지 시간으로 약 아침 9시 30분(0230 UTC)에 JAS 39C 그리펜이 활주...,El JAS 39C Gripen impactó contra una pista cer...,JAS 39C Gripen ตกลงบนรันเวย์เมื่อเวลาประมาณ 09...,Chiếc JAS 39C Gripen đâm xuống đường băng vào ...
3,The pilot was identified as Squadron Leader Di...,पायलट की पहचान स्क्वाड्रन लीडर दिलोकृत पटावी क...,Pilot tersebut diidentifikasi sebagai Pemimpin...,操縦士は中隊長のディロクリット・パタヴェー氏であることが確認されました。,그 조종사는 비행 중대장 딜로크리트 패타비로 확인되었다.,Se identificó al piloto como Dilokrit Pattavee...,นักบินคนดังกล่าวถูกระบุว่าเป็นนาวาอากาศตรีดิลก...,Viên phi công được xác định là Chỉ huy đội bay...
4,Local media reports an airport fire vehicle ro...,स्थानीय मीडिया ने बताया है कि कार्रवाई करने के...,Media lokal melaporkan sebuah kendaraan pemada...,地元メディアの報道によると、空港の消防車が対応中に横転したということです。,현지 언론은 공항 소방차가 사고에 대응하는 도중에 전복되었다고 보도했습니다.,La prensa local informó que una patrulla de bo...,ผู้สื่อข่าวท้องถิ่นรายงานว่ารถดับเพลิงประจำสนา...,Truyền thông địa phương đưa tin một phương tiệ...
...,...,...,...,...,...,...,...,...
993,"However, they have a different kind of beauty ...","हालांकि, सर्दियों के दौरान उनका एक अलग तरह का ...","Namun, jenis keindahan dan pesonanya berbeda s...",しかし、多くのヒルステーションには十分な積雪があり、スキーやスノーボードなどのアクティビティ...,그러나 겨울에는 눈이 수북히 쌓여 스키나 스노보드 같은 겨울 액티비티도 즐길 수 있...,"De todas formas, su belleza y encanto en época...",อย่างไรก็ตาม สถานที่เหล่านี้มีความงามและเสน่ห์...,"Tuy nhiên, vào mùa đông, chúng mang một vẻ đẹp..."
994,Only a few airlines still offer bereavement fa...,केवल कुछ ही एयरलाइंस शोक किराए की पेशकश करती ह...,Hanya sedikit maskapai yang masih memberikan t...,死別運賃をまだ提供している航空会社は数社のみですが、この運賃で葬儀直前の旅費がわずかに割り引...,소수의 항공사만이 여전히 장례 유족 운임을 제공한다. 이는 장례를 치르기 위해 긴급...,Las aerolíneas que aún ofrecen tarifas por due...,มีเพียงไม่กี่สายการบินเท่านั้นที่ยังคงเสนอค่าโ...,Chỉ một vài hãng hàng không cung cấp dịch vụ g...
995,"Airlines that offer these include Air Canada, ...",ये ऑफर देने वाली एयरलाइन में यूएस या कनाडा और ...,Maskapai-maskapai yang menawarkan tersebut ter...,該当する航空会社は、エア・カナダ、デルタ航空、ルフトハンザ、ウェストジェットなどです。,"미국이나 캐나다에서 출발하는 항공편에 이것을 제공하는 항공사는 에어 캐나다, 델타 ...",Las compañías aéreas que los ofrecen son Air C...,สายการบินที่ให้บริการเหล่านี้ ได้แก่ แอร์แคนาด...,Các hãng hàng không cung cấp các dịch vụ này b...
996,"In all cases, you must book by phone directly ...","सभी मामलों में, आपको एयरलाइन की बुकिंग सीधे फ़...","Dalam semua kasus, Anda harus memesan melalui ...",いずれの場合も、航空会社に直接電話で予約する必要があります。,"어떤 경우에도, 전화로 항공사에 직접 예약해야 합니다.","En todos los casos, usted tiene que hacer la r...",คุณต้องจองทางโทรศัพท์โดยตรงกับสายการบินในทุกกรณี,"Trong mọi trường hợp, bạn phải gọi điện đặt tr..."
