Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tabular Data - Read Column By Column #33

Open
tmgcassidy opened this issue Dec 3, 2020 · 1 comment
Open

Tabular Data - Read Column By Column #33

tmgcassidy opened this issue Dec 3, 2020 · 1 comment

Comments

@tmgcassidy
Copy link

Is there any way to leverage Tesseract for table extraction? Below is an example of a table I would need to extract Chinese Characters (Simplified) from.

image

@blackholeearth
Copy link
Contributor

blackholeearth commented Feb 5, 2022

For tesseract.exe, You need to use hocr command. İt gives x,y cordination of recognized text in html.

You have to enable wordLevelSegmentation.
(İ dont know the exact cmd commands.)


For c# https://stackoverflow.com/questions/51282214/tesseract-ocr-text-position

You can specify word segmentation in c# code.

\\how to use
Void Main()
{
\\ İm writing on AndroidPhone.
\\ img is either. 
Var img =  "imgFilePath.jpg";
\\ or
Var img = bitmap("imgFilePath.jpg");

var li1 = GetWordsFromImage(img);
var TessWord_li = AssignColumnNo (li1);

Var AllColumnNos = TessWord_li.Select(x=>x.ColumnNo). Distinct ().ToList();

\\\get the 1st n 2nd Column. Etc
Var wordsThatAreOn_col1 = TessWord_li.Where(x=>x.ColumnNo==1).ToList();
Var wordsThatAreOn_col2 = TessWord_li.Where(x=>x.ColumnNo==2).ToList();


}

\\ required class and functions
Class tessWord
{
  Public String Text;
  Public Rectangle rect; // bounds
  Public Rectangle  rectYCordZeroed; 
  Public int ColumnNo=0;
}

List<TessWord> GetWordsFromImage(img)
{
   Var tessWord_li = new List<TessWord>();
   var myLevel = 
   Tesseract.PageIteratorLevel.Word;
   using (var page = Engine.Process(img))
   using (var iter = page.GetIterator())
   {
       iter.Begin();
       do
       {
          Rectangle curRect;
           if (iter.TryGetBoundingBox(myLevel, out var curRect))
           {
            var curText = iter.GetText(myLevel);
  
          // Add recognized word ToList
           Var w= new tessWord (){
              Text =curText,
             Rect = curRect,   
             \\Set  rectangle Ycord to zero
            rectYCordZeroed = /*todo */ ;
           };
           TessWord_li.Add(w);
        }
    } while (iter.Next(myLevel));

}
}

List<tessWord> AssignColumnNo(t  List<TessWord> tessWord_li )
{

Foreach(var word1 in TessWord_li)
Foreach(var word2 in TessWord_li)
{
   İf (Word1.rectYzero.intersects( word2.rectYZero) )
   {
      \\Two words on same column.
      If(Word1.ColumnNo ==0)
      {
         \\if word has noColNo. Give it new ColNo
         Word1.ColumnNo = TessWord_li.Select(x=>x.ColumnNo).Max()+1;
      }
      Word2.ColumnNo = Word1.ColumnNo;
   }
}
 return tessWord_li; 
} 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants