## Description
In this file, I will split variable names into terms. My first approach is to identy the naming convention of the variable name, and then split the name based on the naming convention.  
A potential problem is mixed style names, such as "get_numFish" and no style names such as "filenames". This is known as the "The Identifier-Splitting Problem", and we will come back to address this later.  

## Read Data
First let's read the data from previous works. My previous work takes projects from github, label them as Chinese Author or English Author, and find all python files from these projects.  
Then I extract all the function and variable names form the python files, and stored their  to a database, under table name "NameTable"

In [1]:
import pandas as pd
import sqlite3

In [2]:
name_table = "NameTable"
# sqlite connect is also relative path relative to the folder running the script.
conn = sqlite3.connect('data.db')
query = f"SELECT * FROM {name_table}"
df = pd.read_sql_query(query, conn)

In [3]:
df.head()

Unnamed: 0,id,name,nameType,nameScope,projectSize,authorName,authorProficiency,authorLocation
0,0,_raise_err,function,GlobalScope,72400,programthink,<50,China
1,1,_load_yaml,function,GlobalScope,72400,programthink,<50,China
2,2,Node,class,GlobalScope,72400,programthink,<50,China
3,3,Relation,class,GlobalScope,72400,programthink,<50,China
4,4,Family,class,GlobalScope,72400,programthink,<50,China


hazard: "VCdimension". This is not a good name, but what if some one name it this way? His fault....  
hazard: "Sovits". This is a name


In [14]:
import re

def split_only_letter_name(name):
    # assume that name contains both upper and lower cases
    terms = []
    if name[0].islower():
        # extract the first term
        lower_term = re.match(r'^([a-z]+)', name)
        terms.append(lower_term.group())
    # find name in "UPPER" form, that means, unless look ahead, it is followed by a lower case: then that upper case letter belongs to another word
    matches = re.findall(r'([A-Z]+)(?![a-z])', name)
    terms = terms +  matches
    # find name in "Upper" form, that is, Upper followed by a lower (until meet another upper)
    matches = re.findall( r'[A-Z][a-z]+', name)
    terms = terms + matches
    
    return terms
def split_names(name):
    # We can assume input is a valid python name, that means it only contains letters, numbers and underscore. 
    # split by underscore and number, so the only possibility left is letters
    parts = re.split(r'[_\d]+', name)
    terms = []
    for term in parts:
        if term == '':
            continue
        # if it's all lower case or all upper case then we ignore it. 
        if re.fullmatch(r'[a-z]+', term) or re.fullmatch(r'[A-Z]+', term) :
            terms.append(term)
            continue
        try:
            # todo handle "gamma"
            split_terms = split_only_letter_name(term)
            terms += split_terms
        except:
            pass
        
    return terms


0          [raise, err]
1          [load, yaml]
2                [Node]
3            [Relation]
4              [Family]
               ...     
8234210          [long]
8234211      [getslice]
8234212      [setslice]
8234213      [delslice]
8234214       [factory]
Name: name, Length: 8234215, dtype: object

Now we will apply the split function to dataframe

In [15]:
df['terms'] = df['name'].apply(split_names)


In [16]:
df

Unnamed: 0,id,name,nameType,nameScope,projectSize,authorName,authorProficiency,authorLocation,terms
0,0,_raise_err,function,GlobalScope,72400,programthink,<50,China,"[raise, err]"
1,1,_load_yaml,function,GlobalScope,72400,programthink,<50,China,"[load, yaml]"
2,2,Node,class,GlobalScope,72400,programthink,<50,China,[Node]
3,3,Relation,class,GlobalScope,72400,programthink,<50,China,[Relation]
4,4,Family,class,GlobalScope,72400,programthink,<50,China,[Family]
...,...,...,...,...,...,...,...,...,...
8234210,8234210,__long__,variable,FunctionScope,10460,juvers,>100,USA,[long]
8234211,8234211,__getslice__,variable,FunctionScope,10460,juvers,>100,USA,[getslice]
8234212,8234212,__setslice__,variable,FunctionScope,10460,juvers,>100,USA,[setslice]
8234213,8234213,__delslice__,variable,FunctionScope,10460,juvers,>100,USA,[delslice]


Now we will identify the naming convention of each variable

In [18]:
def identify_naming_convention(row):
    # let's not consider the case of numbers: there are exactly the following scenarios
    # (lower, Upper, UPPER) x underscore
    # combination of one: 
    # lower X  no underscore - Unknown, lower X  underscore - Snake
    # Upper X  no underscore - Camel, Upper X  underscore - Mixed
    # UPPER - SCREAMING, UPPER X single letter X no underscore - Unknown
    # combination of two: 
    # lower Upper X no underscore - Camel, lower Upper X underscore - Mix
    # lower UPPER X no underscore - unconventional, lower UPPER X underscore - unconventional
    # Upper UPPER X no underscore - unconventional, Upper UPPER X underscore - unconventional
    # combination of three: unconventional
    name = row['name']
    terms = row['terms']
     # if all the terms are lower case or upper single letter, then we don't know
    if '_' not in name and all(term.islower() for term in terms):
        return 'Unknown'
    if '_' not in name and(len(terms) == 1 and re.match(r'^[A-Z]$', terms[0])):
        return 'Unknown'
    # if the name is UPPER_UPPER, or just UPPER
    if all(term.isupper() for term in terms):
        return "Snake-Screaming"
    # lower X  underscore
    if '_' in name and all(term.islower() for term in terms):
        return "Snake"
    # lower Upper X no underscore - Camel
    if '_' not in name and all(re.match(r'^[a-zA-Z][a-z]*$', term) for term in terms):
        # if name is lowerUpper
        if name[0].islower():
            return "Camel"
        # if the name is UpperUpper
        else:
            return "Pascal"
    # if the name is lower_lowerUpper
    if '_' in name and all(re.match(r'^[a-zA-Z][a-z]*$', term) for term in terms):
        return "Mixed"
    # if not all the terms are upper case, not all the terms are lower case, then according to how I split words, it need to be mix of underscore, lower, Upper and UPPER. 
    # lowerUpper, lower_Upper, all excluded
    # what's left are: lowerUPPER, lower_UPPER, Upper_UPPER, 
    return "non-convention"
df['namingConvention'] = df.apply(identify_naming_convention, axis=1)


In [19]:
df

Unnamed: 0,id,name,nameType,nameScope,projectSize,authorName,authorProficiency,authorLocation,terms,namingConvention
0,0,_raise_err,function,GlobalScope,72400,programthink,<50,China,"[raise, err]",Snake
1,1,_load_yaml,function,GlobalScope,72400,programthink,<50,China,"[load, yaml]",Snake
2,2,Node,class,GlobalScope,72400,programthink,<50,China,[Node],Pascal
3,3,Relation,class,GlobalScope,72400,programthink,<50,China,[Relation],Pascal
4,4,Family,class,GlobalScope,72400,programthink,<50,China,[Family],Pascal
...,...,...,...,...,...,...,...,...,...,...
8234210,8234210,__long__,variable,FunctionScope,10460,juvers,>100,USA,[long],Snake
8234211,8234211,__getslice__,variable,FunctionScope,10460,juvers,>100,USA,[getslice],Snake
8234212,8234212,__setslice__,variable,FunctionScope,10460,juvers,>100,USA,[setslice],Snake
8234213,8234213,__delslice__,variable,FunctionScope,10460,juvers,>100,USA,[delslice],Snake


Store result to database, replace the same table

In [20]:
import json
df['terms'] = df['terms'].apply(json.dumps)
df

Unnamed: 0,id,name,nameType,nameScope,projectSize,authorName,authorProficiency,authorLocation,terms,namingConvention
0,0,_raise_err,function,GlobalScope,72400,programthink,<50,China,"[""raise"", ""err""]",Snake
1,1,_load_yaml,function,GlobalScope,72400,programthink,<50,China,"[""load"", ""yaml""]",Snake
2,2,Node,class,GlobalScope,72400,programthink,<50,China,"[""Node""]",Pascal
3,3,Relation,class,GlobalScope,72400,programthink,<50,China,"[""Relation""]",Pascal
4,4,Family,class,GlobalScope,72400,programthink,<50,China,"[""Family""]",Pascal
...,...,...,...,...,...,...,...,...,...,...
8234210,8234210,__long__,variable,FunctionScope,10460,juvers,>100,USA,"[""long""]",Snake
8234211,8234211,__getslice__,variable,FunctionScope,10460,juvers,>100,USA,"[""getslice""]",Snake
8234212,8234212,__setslice__,variable,FunctionScope,10460,juvers,>100,USA,"[""setslice""]",Snake
8234213,8234213,__delslice__,variable,FunctionScope,10460,juvers,>100,USA,"[""delslice""]",Snake


In [21]:
df.to_sql(name_table, conn, index=False, if_exists='replace')
conn.commit()
conn.close()

Let's check the result

In [22]:
conn = sqlite3.connect('data.db')
query = f"SELECT * FROM {name_table}"
df = pd.read_sql_query(query, conn)

In [23]:
df

Unnamed: 0,id,name,nameType,nameScope,projectSize,authorName,authorProficiency,authorLocation,terms,namingConvention
0,0,_raise_err,function,GlobalScope,72400,programthink,<50,China,"[""raise"", ""err""]",Snake
1,1,_load_yaml,function,GlobalScope,72400,programthink,<50,China,"[""load"", ""yaml""]",Snake
2,2,Node,class,GlobalScope,72400,programthink,<50,China,"[""Node""]",Pascal
3,3,Relation,class,GlobalScope,72400,programthink,<50,China,"[""Relation""]",Pascal
4,4,Family,class,GlobalScope,72400,programthink,<50,China,"[""Family""]",Pascal
...,...,...,...,...,...,...,...,...,...,...
8234210,8234210,__long__,variable,FunctionScope,10460,juvers,>100,USA,"[""long""]",Snake
8234211,8234211,__getslice__,variable,FunctionScope,10460,juvers,>100,USA,"[""getslice""]",Snake
8234212,8234212,__setslice__,variable,FunctionScope,10460,juvers,>100,USA,"[""setslice""]",Snake
8234213,8234213,__delslice__,variable,FunctionScope,10460,juvers,>100,USA,"[""delslice""]",Snake
