<a href="https://colab.research.google.com/github/tfbf/uW/blob/master/ParseTranslationNotes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## USFM
[USFM Documentation](https://ubsicap.github.io/usfm/)
## Input file
[Input Repo](https://git.door43.org/Door43-Catalog/hi_tn)
## Expected output
```
\id TIT                                     # Book Id. Must be the first line of the output. Only one **\id** is permitted in the output.
\c 1                                        # Chapter No. Occurs once for each chapter
\p                                          # Para placeholder
\v 1                                        # Verse No
\p                                          # Para placeholder
\tr                                         # Row Begin
\tc1 Son of God                             # Column: GLQuote
\tc2 Υἱοῦ Θεοῦ                              # Column: OrigQuote
\tc3 guidelines-sonofgodprinciples          # Column: SupportReference
\tr                                         # Row Begin
\tc1-3 यह यीशु के लिए एक महत्वपूर्ण पदवी है।           # Merged-Column: OccurrenceNote
\p                                          # Para placeholder
\tr                                         # Row Begin
\tc1 that agrees with godliness             # Column: GLQuote
\tc2 τῆς κατ’ εὐσέβειαν                     # Column: OrigQuote
\tc3                                        # Column: SupportReference
\tr                                         # Row Begin
\tc1-3 जो परमेश्वर को आदर देने के लिए उपयुक्त हो       # Merged-Column: OccurrenceNote
```


# Implemenatation

### Import Python Packages Here

In [0]:
import io
import pandas as pd
import requests

### Program Configurations

In [0]:
col_dtypes = {
    "Book": "category",
    "Chapter": "category",
    "Verse": "category",
    "SupportReference": "object",
    "OrigQuote": "object",
    "GLQuote": "object",
    "OccurrenceNote": "object"
}
columns = list(col_dtypes.keys())
group_by_cols = ["Book", "Chapter", "Verse"]

sep="\t"

src_path = "https://git.door43.org/Door43-Catalog/hi_tn/raw/branch/master/"
src_files = ["hi_tn_42-MRK.tsv", "hi_tn_48-2CO.tsv", "hi_tn_49-GAL.tsv", "hi_tn_57-TIT.tsv", "hi_tn_58-PHM.tsv",
             "hi_tn_61-1PE.tsv", "hi_tn_63-1JN.tsv", "hi_tn_64-2JN.tsv", "hi_tn_65-3JN.tsv", "hi_tn_66-JUD.tsv"]

### Function (_df_to_usfm_) to Convert _pandas_ DataFrame to USFM format

In [0]:
def gdf_to_usfm(gdf):
  df = gdf.reset_index(drop=True)

  usfm_head = "\\id {0}\n\\c {1}\n\\p\n\\v {2}\n\\p"
  usfm_each = "\\tr\n\\tc1 {0}\n\\tc2 {1}\n\\tc3 {2}\n\\tr\n\\tc1-3 {3}"

  # print(usfm_head.format(gdf.iloc[0]["Book"], gdf.iloc[0]["Chapter"], gdf.iloc[0]["Verse"]))
  
  body = []
  for i, r in df.iterrows() :
    if i == 0:
      # print(usfm_head.format(r["Book"], r["Chapter"], r["Verse"]))
      head = usfm_head.format(r["Book"], r["Chapter"], r["Verse"])
    # print(usfm_each.format(r["GLQuote"], r["OrigQuote"], r["SupportReference"], r["OccurrenceNote"]))
    body.append(usfm_each.format(r["GLQuote"], r["OrigQuote"], r["SupportReference"], r["OccurrenceNote"]))
  
  # print("{0}\n{1}\n".format(head, "\n\\p\n".join(body)))
  return ("{0}\n{1}\n".format(head, "\n\\p\n".join(body)))

def df_to_usfm(df, sep_group=True): # sep_group flag when true will separate each Verse group by an extra Newline for better understanding
  group_df = df.groupby(group_by_cols)
  # group_df = df.query('Chapter=="1" and Verse=="1"').groupby(group_by_cols)   # Just testing with Chapter 1 and Verse 1
  # print(('\n' if sep_group else '').join(list(group_df.apply(gdf_to_usfm))))
  return (('\n' if sep_group else '').join(list(group_df.apply(gdf_to_usfm))))

### Iterate through each file and apply the function

In [0]:
for src_file in src_files[3:4]: # Let's test on Titus first. Once the code functions, you can remove the slicing operation, so that it will convert the entire set of files.
    # Fetch data from Url
    s=requests.get(src_path+src_file).content
    # Load pandas DataFrame
    tnotes = pd.read_csv(io.StringIO(s.decode('utf-8')), delimiter=sep)
    
    # Filter with the given columns
    tnotes = tnotes[columns]
    # Fill NaN with empty string
    tnotes = tnotes.fillna("")
    # Enforce the given Col datatypes for better performance and data stability
    tnotes = tnotes.astype(col_dtypes)

    # # Convert tnotes dataframe to usfm data
    # tnotes_usfm = df_to_usfm(tnotes, sep_group=True)
    # # Save the usfm data to the following file path
    # save_file = "{0}.usfm".format(src_file.split('.')[0])
    # with open(save_file, "w") as f:
    #   f.write(tnotes_usfm)

# Solution Sample Output

### DataFrame column types

In [0]:
tnotes.dtypes

Book                category
Chapter             category
Verse               category
SupportReference      object
OrigQuote             object
GLQuote               object
OccurrenceNote        object
dtype: object

### Grouped by Book, Chapter and Verse

In [7]:
tnotes.groupby(group_by_cols).describe().head(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,SupportReference,SupportReference,SupportReference,SupportReference,OrigQuote,OrigQuote,OrigQuote,OrigQuote,GLQuote,GLQuote,GLQuote,GLQuote,OccurrenceNote,OccurrenceNote,OccurrenceNote,OccurrenceNote
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,count,unique,top,freq,count,unique,top,freq,count,unique,top,freq,count,unique,top,freq
Book,Chapter,Verse,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2
TIT,1,1,2,1,,2,2,2,κατὰ πίστιν,1,2,2,that agrees with godliness,1,2,2,विश्वास को मजबूत करने क लिए,1
TIT,1,10,4,3,,2,4,4,οἱ ἐκ τῆς περιτομῆς,1,4,4,those of the circumcision,1,4,4,यह यहूदी मसीहियों के सन्दर्भ में है जो यह सिखा...,1
TIT,1,11,4,1,,4,4,4,οὓς δεῖ ἐπιστομίζειν,1,4,4,what they should not teach,1,4,4,समस्त परिवारों को बर्बाद कर देते हैं। मुद्दा य...,1
TIT,1,12,3,3,figs-hyperbole,1,3,3,κακὰ θηρία,1,3,3,One of their own prophets,1,3,3,यह रूपक क्रेती लोगों की तुलना जंगली पशुओं से क...,1
TIT,1,13,2,1,,2,2,2,δι’ ἣν αἰτίαν ἔλεγχε αὐτοὺς ἀποτόμως,1,2,2,so that they may be sound in the faith,1,2,2,ताकि उनके पास स्वस्थ विश्वास हो या “ताकि उनका ...,1
TIT,1,14,2,2,figs-metaphor,1,2,2,Ἰουδαϊκοῖς μύθοις,1,2,2,Jewish myths,1,2,2,यह यहूदियों की झूठी शिक्षा के सन्दर्भ में है।,1
TIT,1,15,3,2,,2,3,3,"τοῖς μεμιαμμένοις καὶ ἀπίστοις, οὐδὲν καθαρόν",1,3,3,"To those who are pure, all things are pure",1,3,3,"यदि लोग अन्दर से शुद्ध हैं, तो जो कुछ भी वे कर...",1
TIT,1,16,2,1,,2,2,2,βδελυκτοὶ ὄντες,1,2,2,they deny him by their actions,1,2,2,जिस तरह से वे जीते हैं उससे सिद्ध होता है कि व...,1
TIT,1,2,1,1,,1,1,1,πρὸ χρόνων αἰωνίων,1,1,1,before all the ages of time,1,1,1,समय के प्रारम्भ से पहले,1
TIT,1,3,4,2,,3,4,4,καιροῖς ἰδίοις,1,4,4,At the right time,1,4,4,उसने मुझ पर भरोसा किया कि मैं आगे ले जाऊं या “...,1


### _df_to_usfm_ Output

In [8]:
df_to_usfm(tnotes, sep_group=True)    # This function can be used in the loop above where we read the data from input tsv files

