Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stats on fishy endings #178

Open
drdhaval2785 opened this issue Dec 2, 2015 · 2 comments
Open

stats on fishy endings #178

drdhaval2785 opened this issue Dec 2, 2015 · 2 comments

Comments

@drdhaval2785
Copy link
Contributor

As desired by @gasyoun the following is list of headword endings which are found in <50 headwords in sanhw1.txt. Not sorted anyhow.
Just a rough list to play with.

How to use ?

  1. Open sanhw1.txt in notepad++
  2. By regex find words which have 'MS:' in the whole document (Find All in Current Document)
  3. Examine them to find out errors.
    Abnormal endings #177 was found by the same method.
[(32, u'MS'), (49, u'Ak'), (31, u'Mh'), (27, u'aN'), (7, u'aY'), (6, u'ot'), (31
, u'fc'), (37, u'qe'), (32, u'Re'), (23, u'go'), (21, u'Im'), (35, u'Mc'), (2, u
'Mn'), (46, u'Qi'), (27, u'yU'), (13, u'qU'), (25, u'zU'), (11, u'mU'), (7, u'nf
'), (12, u'Ku'), (7, u'Ag'), (37, u'ag'), (10, u'dy'), (17, u'tO'), (7, u'RU'),
(27, u'nU'), (6, u'mO'), (1, u'ID'), (32, u'rO'), (25, u'yO'), (10, u'RO'), (10,
 u'eH'), (22, u'nO'), (35, u'im'), (9, u'It'), (19, u'rc'), (19, u'gU'), (3, u'o
i'), (15, u'aG'), (13, u'Gi'), (28, u'Gu'), (3, u'Go'), (13, u'oH'), (2, u'Ni'),
 (5, u'NI'), (14, u'Mk'), (30, u'Nk'), (31, u'Uy'), (4, u'MK'), (21, u'NK'), (17
, u'Mg'), (33, u'Ng'), (10, u'MG'), (21, u'NG'), (39, u'iw'), (11, u'iR'), (18,
u'ia'), (40, u'Rq'), (1, u'ai'), (6, u'Ai'), (42, u'iv'), (33, u'nv'), (45, u'rE
'), (7, u'Ce'), (2, u'Ha'), (39, u'Iv'), (34, u'jU'), (20, u'Ja'), (2, u'Ji'), (
20, u'YI'), (8, u'cU'), (1, u'Ug'), (14, u'YC'), (43, u'Mj'), (10, u'So'), (13,
u'Wu'), (9, u'La'), (1, u'OW'), (25, u'ww'), (26, u'Al'), (22, u'aW'), (4, u'uA'
), (45, u'Wi'), (32, u'aq'), (4, u'Om'), (44, u'bi'), (4, u'qq'), (11, u'ig'), (
7, u'iY'), (21, u'RW'), (11, u'MW'), (1, u'Ea'), (46, u'ik'), (4, u'tT'), (39, u
'mp'), (49, u'rd'), (43, u'me'), (38, u'uS'), (17, u'rh'), (37, u'gE'), (11, u'i
N'), (33, u'zw'), (19, u'we'), (8, u'jf'), (4, u'fH'), (39, u'tF'), (20, u'Ip'),
 (44, u'Av'), (3, u'cf'), (47, u'nD'), (33, u'ev'), (49, u'Iq'), (36, u'Es'), (1
3, u'ed'), (8, u'iq'), (28, u'pf'), (22, u'pF'), (6, u'lB'), (20, u'ge'), (35, u
'cC'), (19, u'aC'), (38, u'ez'), (49, u'AD'), (41, u'fh'), (26, u'Az'), (44, u'U
z'), (28, u'Bf'), (8, u'sj'), (34, u'nT'), (31, u'jj'), (21, u'rz'), (10, u'lg')
, (24, u'ts'), (44, u'De'), (41, u'uq'), (5, u'vF'), (18, u'ep'), (19, u'Qf'), (
28, u'ro'), (29, u'Il'), (21, u'el'), (3, u'Id'), (38, u'pe'), (20, u'rD'), (23,
 u'mf'), (2, u'IR'), (20, u'Iz'), (10, u'fn'), (38, u'to'), (6, u'Na'), (22, u'd
f'), (31, u'Cu'), (29, u'Uh'), (11, u'oh'), (23, u'Te'), (16, u'et'), (13, u'iM'
), (37, u'mu'), (32, u'ry'), (3, u'To'), (21, u'so'), (48, u'sy'), (42, u'Ap'),
(18, u'uN'), (14, u'dU'), (4, u'eN'), (23, u'do'), (1, u'dw'), (1, u'dq'), (2, u
'Ot'), (43, u'uw'), (8, u'nE'), (11, u'gf'), (25, u'yo'), (43, u'je'), (8, u'Ke'
), (12, u'PA'), (14, u'uW'), (37, u'rR'), (7, u'Ik'), (23, u'Ir'), (5, u'IS'), (
4, u'Do'), (19, u'ok'), (36, u'kF'), (6, u'eD'), (1, u'En'), (15, u'GI'), (3, u'
hy'), (20, u'ce'), (45, u'ke'), (14, u'In'), (11, u'AN'), (18, u'Aq'), (34, u'ne
'), (1, u'tk'), (12, u'EH'), (15, u'Mp'), (19, u'rt'), (30, u'Uj'), (11, u'Ft'),
 (18, u'xp'), (38, u'Md'), (20, u'gF'), (7, u'Mt'), (6, u'Co'), (12, u'lp'), (17
, u'ns'), (8, u'jF'), (7, u'uY'), (14, u'rk'), (12, u'fB'), (14, u'dF'), (4, u'A
T'), (1, u'Lp'), (26, u'iK'), (12, u'zo'), (2, u'Sy'), (9, u'AR'), (11, u'rT'),
(20, u'MD'), (4, u'rg'), (11, u'Ij'), (15, u'Sc'), (15, u'en'), (23, u'll'), (21
, u'yf'), (9, u'Sf'), (8, u'ug'), (7, u'ub'), (16, u'tv'), (28, u'rC'), (1, u'UK
'), (6, u'UN'), (5, u'jJ'), (12, u'Ut'), (10, u'Ud'), (16, u'Yu'), (22, u'no'),
(4, u'nc'), (2, u'nj'), (2, u'ec'), (10, u'iy'), (28, u'av'), (27, u'Be'), (9, u
'oc'), (1, u'nh'), (3, u'af'), (12, u'ej'), (31, u'az'), (8, u'zE'), (17, u'uT')
, (14, u'Is'), (6, u'Un'), (1, u'vr'), (1, u'UB'), (14, u'SF'), (2, u'oB'), (8,
u'oz'), (8, u'Yi'), (7, u'ab'), (9, u'bj'), (3, u'MH'), (10, u'jO'), (10, u'Gf')
, (6, u'JA'), (2, u'Ju'), (6, u'fs'), (8, u'UR'), (11, u'We'), (15, u'er'), (4,
u'iT'), (6, u'fq'), (28, u'Se'), (7, u'iB'), (3, u'DO'), (20, u'kE'), (13, u'Us'
), (5, u'Uc'), (13, u'ps'), (4, u'oj'), (5, u'pE'), (9, u'fC'), (5, u'Br'), (1,
u'rw'), (19, u'Mb'), (23, u'MB'), (6, u'Bo'), (8, u'lo'), (5, u'dE'), (5, u'ko')
, (4, u'rG'), (10, u'Ro'), (4, u'dO'), (2, u'dd'), (10, u'rp'), (5, u'rP'), (16,
 u'rb'), (35, u'rv'), (7, u'rS'), (32, u'bU'), (25, u'il'), (12, u'Ul'), (2, u'M
Q'), (28, u'Mq'), (8, u'es'), (1, u'Ls'), (3, u'iP'), (2, u'sE'), (3, u'rm'), (1
3, u'zy'), (13, u'SU'), (4, u'hO'), (4, u'Sv'), (16, u'CI'), (7, u'vO'), (18, u'
QI'), (3, u'Et'), (5, u'kO'), (21, u'lU'), (11, u'fg'), (10, u'fN'), (10, u'sO')
, (7, u'ho'), (4, u'AI'), (2, u'Au'), (1, u'Af'), (10, u'kU'), (7, u'IM'), (4, u
'tt'), (21, u'Ur'), (4, u'AC'), (13, u'Ci'), (3, u'MC'), (3, u'cE'), (31, u'ul')
, (1, u'Ge'), (8, u'be'), (11, u'ol'), (4, u'bd'), (3, u'px'), (5, u'po'), (10,
u'eq'), (15, u'tU'), (1, u'Ek'), (17, u'lE'), (9, u'co'), (1, u'nM'), (8, u'lO')
, (2, u'LA'), (1, u'Mv'), (5, u'iC'), (2, u'op'), (1, u'mv'), (3, u'jy'), (35, u
'vu'), (2, u'jE'), (19, u'Ry'), (9, u'Dy'), (1, u'IL'), (1, u'IK'), (14, u'IN'),
 (3, u'Ih'), (11, u'vo'), (2, u'uK'), (11, u'on'), (6, u'Er'), (5, u'uC'), (8, u
'CU'), (1, u'YJ'), (5, u'pO'), (12, u'ny'), (3, u'Ic'), (1, u'dg'), (10, u'MT'),
 (1, u'dj'), (1, u'dJ'), (1, u'wF'), (1, u'yM'), (1, u'aF'), (7, u'Ok'), (4, u'd
v'), (1, u'ds'), (4, u'zO'), (14, u'or'), (4, u'mF'), (5, u'lh'), (3, u'lP'), (3
, u'ED'), (2, u'od'), (4, u'BO'), (2, u'UM'), (1, u'M~'), (1, u'UW'), (4, u'Um')
, (31, u'mE'), (2, u'U~'), (11, u'fR'), (3, u'fP'), (9, u'mP'), (1, u'FH'), (27,
 u'x'), (1, u'xN'), (1, u'xw'), (1, u'X'), (1, u'XH'), (5, u'Ew'), (3, u'gO'), (
4, u'eW'), (3, u'wE'), (4, u'ey'), (1, u'em'), (13, u'zf'), (5, u'eh'), (1, u'oM
'), (1, u'oK'), (7, u'Kf'), (8, u'jo'), (5, u'oR'), (7, u'Rf'), (5, u'om'), (1,
u'lj'), (1, u'o~'), (12, u'OH'), (1, u'O~'), (12, u'un'), (15, u'kk'), (2, u'kK'
), (9, u'aK'), (17, u'Rw'), (11, u'Mw'), (5, u'uy'), (2, u'fT'), (2, u'PI'), (4,
 u'bf'), (5, u'Pu'), (9, u'vy'), (3, u'TO'), (1, u'Va'), (19, u'my'), (7, u'fw')
, (11, u'wy'), (8, u'zk'), (2, u'Iw'), (9, u'uR'), (6, u'dr'), (2, u'Nu'), (4, u
'ib'), (13, u'qE'), (2, u'By'), (2, u'sm'), (2, u'qo'), (1, u'Uw'), (8, u'Uq'),
(11, u'Up'), (1, u'Sm'), (6, u'fY'), (5, u'Rv'), (1, u'fb'), (1, u'fv'), (1, u'z
R'), (3, u'FY'), (2, u'kx'), (1, u'xb'), (10, u'ex'), (6, u'UY'), (1, u'z2'), (3
, u'Li'), (2, u'Lu'), (16, u'qf'), (1, u'cO'), (4, u'dD'), (3, u'Ib'), (7, u'eS'
), (32, u'ty'), (1, u'rH'), (4, u'ow'), (3, u'wU'), (1, u'Mr'), (1, u'qM'), (14,
 u'ew'), (3, u'rf'), (3, u'KE'), (2, u'Ko'), (1, u'ox'), (6, u'oq'), (2, u'gG'),
 (1, u'mx'), (5, u'bb'), (2, u'lv'), (7, u'uM'), (3, u'uP'), (1, u'Qe'), (1, u'z
W'), (1, u'OT'), (12, u'ek'), (4, u'Mz'), (1, u'sx'), (4, u'RR'), (1, u'MR'), (5
, u'iA'), (1, u'rn'), (3, u'A~'), (1, u'NO'), (1, u'KO'), (2, u'CO'), (2, u'aQ')
, (9, u'st'), (1, u'sh'), (2, u'hn'), (5, u'IB'), (1, u'fI'), (6, u'cy'), (2, u'
uQ'), (1, u'AU'), (1, u'Ye'), (1, u'YO'), (1, u'JJ'), (3, u'Qu'), (2, u'rJ'), (2
, u'Fz'), (1, u'og'), (11, u'J'), (2, u'vU'), (2, u'JI'), (1, u'Jf'), (1, u'JF')
, (3, u'IY'), (1, u'RQ'), (6, u'mo'), (3, u'rq'), (5, u'iG'), (1, u'au'), (2, u'
Ty'), (1, u'nk'), (4, u'PU'), (3, u'EN'), (3, u'TU'), (1, u'TE'), (3, u'rB'), (2
, u'AY'), (1, u'ck'), (2, u'Hk'), (2, u'HK'), (3, u'Ky'), (2, u'Hp'), (1, u'uG')
, (1, u'HI'), (1, u'zK'), (2, u'py'), (1, u'Ep'), (1, u'Or'), (1, u'OS'), (1, u'
Oz'), (1, u'Os'), (6, u'AK'), (8, u'AG'), (3, u'mm'), (1, u'ii'), (1, u'US'), (1
, u'DF'), (2, u'oI'), (4, u'sv'), (2, u'uv'), (2, u'RE'), (7, u'Tf'), (2, u'AB')
, (1, u'nF'), (2, u'ES'), (1, u'Ez'), (1, u'fA'), (1, u'tx'), (1, u'iu'), (1, u'
fM'), (2, u'wo'), (1, u'wr'), (1, u'Ab'), (1, u'qv'), (2, u'cc'), (1, u'iW'), (3
, u'zx'), (1, u'zp'), (4, u'eR'), (4, u'eb'), (3, u'ER'), (3, u'pS'), (1, u'Le')
, (1, u'oT'), (2, u'Pi'), (1, u'Pe'), (3, u'hl'), (2, u'hv'), (2, u'bF'), (1, u'
bo'), (1, u'hm'), (1, u'4n'), (1, u'WO'), (1, u'gv'), (1, u'BF'), (1, u'nS'), (2
, u'aU'), (1, u'aI'), (2, u'Dv'), (5, u'sk'), (1, u'eM'), (4, u'Sr'), (4, u'sr')
, (1, u'Ig'), (1, u'cx'), (1, u'fL'), (1, u'gy'), (5, u'fl'), (1, u'eG'), (2, u'
eT'), (1, u'eC'), (3, u'Ow'), (6, u'Oq'), (1, u'aP'), (3, u'tE'), (1, u'MP'), (2
, u'ee'), (1, u'Ia'), (1, u'ui'), (2, u'eL'), (1, u'eB'), (1, u'Ey'), (1, u'GU')
, (1, u'Gv'), (3, u'vv'), (2, u'Wy'), (1, u'RT'), (1, u'lf'), (1, u'lF'), (1, u'
eK'), (1, u'qQ'), (1, u'Je'), (2, u'SE'), (4, u'lk'), (2, u'yv'), (3, u'dx'), (3
, u'eY'), (1, u'Ml'), (1, u'Em'), (2, u'SO'), (1, u'ly'), (1, u'lb'), (1, u'aL')
, (2, u'Fh'), (1, u'2a'), (1, u'ks'), (3, u'ng'), (2, u'Wf'), (1, u'wO'), (1, u'
HU'), (1, u'Uk'), (1, u'sF'), (1, u'Mm'), (1, u'ao'), (1, u'eu'), (1, u'yy'), (1
, u'iL'), (1, u'ss'), (1, u'fW'), (1, u'hF'), (1, u'eQ'), (2, u'hE'), (1, u'oQ')
, (1, u'IC')]
@drdhaval2785
Copy link
Contributor Author

http://sanskrit-lexicon.github.io/CORRECTIONS/abnormending/abnorm.html is the output.

https://github.com/sanskrit-lexicon/CORRECTIONS/tree/master/abnormending is the code.

Execution

Run this shell file to regenerate the results.

Logic -

  1. The last two letters of each word in sanhw1.txt is stored as 'endings'.
  2. The words in sanhw1.txt are checked and the count of endings are shown e.g. (2, u'hE') i.e. the words ending with 'hE' are only 2. See this for full list.
  3. Only endings having less than 50 entries (thereby meaning less frequent ones) are kept.
  4. This list is sorted in ascending order (1,2,3.....50).
  5. sanhw1.txt is checked for three criteria (1) words ending in this sorted list of point 4. (2) the word should be seen only in one dictionary and (3) the word should not be seen in nochange.
    To put in regex terms, if re.search(end+':[^,]*$',datum) and datum not in noc.
  6. Words passing the above mentioned criteria are stored in abnorm.txt.
  7. Webpage and PDF are linked by link.php and stored in abnorm.html.
  8. abnorm.html is put on github.io for potential errors and submit corrections.

@gasyoun
Copy link
Member

gasyoun commented Dec 2, 2015

@drdhaval2785 I give you my thanks. I can see dozens of mistakes just by glance. I will start documenting them Hope @zaaf2 is not lost from PWK and PWG.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants