Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converting tables to “running text” #44

Closed
omri-suissa-clearmash opened this issue Mar 21, 2022 · 4 comments
Closed

Converting tables to “running text” #44

omri-suissa-clearmash opened this issue Mar 21, 2022 · 4 comments

Comments

@omri-suissa-clearmash
Copy link

I have a table, for example:

<div>
<table cellspacing="0" style="-aw-border-insideh:0.5pt single #ffffff; -aw-border-insidev:0.5pt single #ffffff; border-collapse:collapse">
	<tbody>
		<tr>
			<td style="background-color:#4472c4; vertical-align:top; width:79.35pt">
			<p><strong>Product</strong></p>
			</td>
			<td style="background-color:#4472c4; vertical-align:top; width:79.35pt">
			<p><strong>Size</strong></p>
			</td>
			<td style="background-color:#4472c4; vertical-align:top; width:79.35pt">
			<p><strong>Price</strong></p>
			</td>
			<td style="background-color:#4472c4; vertical-align:top; width:79.35pt">
			<p><strong>Location</strong></p>
			</td>
			<td style="background-color:#4472c4; vertical-align:top; width:79.4pt">
			<p><strong>Comment</strong></p>
			</td>
		</tr>
		<tr>
			<td style="background-color:#4472c4; vertical-align:top; width:79.35pt">
			<p><strong>A</strong></p>
			</td>
			<td style="background-color:#b4c6e7; border-color:#ffffff; vertical-align:top; width:79.35pt">
			<p>50X20cm</p>
			</td>
			<td style="background-color:#b4c6e7; border-color:#ffffff; vertical-align:top; width:79.35pt">
			<p>55$</p>
			</td>
			<td style="background-color:#b4c6e7; border-color:#ffffff; vertical-align:top; width:79.35pt">
			<p>IL</p>
			</td>
			<td style="background-color:#b4c6e7; border-color:#ffffff; vertical-align:top; width:79.4pt">
			<p>Text text&hellip;</p>
			</td>
		</tr>
		<tr>
			<td style="background-color:#4472c4; vertical-align:top; width:79.35pt">
			<p><strong>B</strong></p>
			</td>
			<td style="background-color:#d9e2f3; border-color:#ffffff; vertical-align:top; width:79.35pt">
			<p>50X20cm</p>
			</td>
			<td style="background-color:#d9e2f3; border-color:#ffffff; vertical-align:top; width:79.35pt">
			<p>55$</p>
			</td>
			<td style="background-color:#d9e2f3; border-color:#ffffff; vertical-align:top; width:79.35pt">
			<p>BK</p>
			</td>
			<td style="background-color:#d9e2f3; border-color:#ffffff; vertical-align:top; width:79.4pt">
			<p>Text text&hellip;</p>
			</td>
		</tr>
		<tr>
			<td style="background-color:#4472c4; vertical-align:top; width:79.35pt">
			<p><strong>C</strong></p>
			</td>
			<td style="background-color:#b4c6e7; border-color:#ffffff; vertical-align:top; width:79.35pt">
			<p>50X20cm</p>
			</td>
			<td style="background-color:#b4c6e7; border-color:#ffffff; vertical-align:top; width:79.35pt">
			<p>55$</p>
			</td>
			<td style="background-color:#b4c6e7; border-color:#ffffff; vertical-align:top; width:79.35pt">
			<p>LM</p>
			</td>
			<td style="background-color:#b4c6e7; border-color:#ffffff; vertical-align:top; width:79.4pt">
			<p>Text text&hellip;</p>
			</td>
		</tr>
		<tr>
			<td style="background-color:#4472c4; vertical-align:top; width:79.35pt">
			<p><strong>D</strong></p>
			</td>
			<td style="background-color:#d9e2f3; border-color:#ffffff; vertical-align:top; width:79.35pt">
			<p>50X20cm</p>
			</td>
			<td style="background-color:#d9e2f3; border-color:#ffffff; vertical-align:top; width:79.35pt">
			<p>55$</p>
			</td>
			<td style="background-color:#d9e2f3; border-color:#ffffff; vertical-align:top; width:79.35pt">
			<p>LM</p>
			</td>
			<td style="background-color:#d9e2f3; border-color:#ffffff; vertical-align:top; width:79.4pt">
			<p>Text text&hellip;</p>
			</td>
		</tr>
		<tr>
			<td style="background-color:#4472c4; vertical-align:top; width:79.35pt">
			<p><strong>E</strong></p>
			</td>
			<td style="background-color:#b4c6e7; border-color:#ffffff; vertical-align:top; width:79.35pt">
			<p>50X20cm</p>
			</td>
			<td style="background-color:#b4c6e7; border-color:#ffffff; vertical-align:top; width:79.35pt">
			<p>55$</p>
			</td>
			<td style="background-color:#b4c6e7; border-color:#ffffff; vertical-align:top; width:79.35pt">
			<p>PP</p>
			</td>
			<td style="background-color:#b4c6e7; border-color:#ffffff; vertical-align:top; width:79.4pt">
			<p>Text text&hellip;</p>
			</td>
		</tr>
		<tr>
			<td style="background-color:#4472c4; vertical-align:top; width:79.35pt">
			<p><strong>f</strong></p>
			</td>
			<td style="background-color:#d9e2f3; border-color:#ffffff; vertical-align:top; width:79.35pt">
			<p>50X20cm</p>
			</td>
			<td style="background-color:#d9e2f3; border-color:#ffffff; vertical-align:top; width:79.35pt">
			<p>55$</p>
			</td>
			<td style="background-color:#d9e2f3; border-color:#ffffff; vertical-align:top; width:79.35pt">
			<p>RXS</p>
			</td>
			<td style="background-color:#d9e2f3; border-color:#ffffff; vertical-align:top; width:79.4pt">
			<p>Text text&hellip;</p>
			</td>
		</tr>
	</tbody>
</table>

<p>&nbsp;</p>
</div>

That inscriptis transform into:

    Product  Size     Price  Location  Comment   
                                                 
    A        50X20cm  55$    IL        Text text…
                                                 
    B        50X20cm  55$    BK        Text text…
                                                 
    C        50X20cm  55$    LM        Text text…
                                                 
    D        50X20cm  55$    LM        Text text…
                                                 
    E        50X20cm  55$    PP        Text text…
                                                 
    f        50X20cm  55$    RXS       Text text…

Which is great. However, I want to convert it into running text:

Product: A Size: 50X20cm Price: 55$ Location: IL Comment: Text text…                                      
Product: B Size: 50X20cm Price: 55$ Location: BK Comment: Text text…           
Product: C Size: 50X20cm Price: 55$ Location: LM Comment: Text text…           
Product: D Size: 50X20cm Price: 55$ Location: LM Comment: Text text…           
Product: E Size: 50X20cm Price: 55$ Location: PP Comment: Text text…           
Product: f Size: 50X20cm Price: 55$ Location: RXS Comment: Text text…

That means for each row, add the column name before the value.
This can be done using inscriptis?

@AlbertWeichselbraun
Copy link
Contributor

this can easily be done:

from inscriptis import get_text

content = open("/tmp/t.html").read()
text = get_text(content)

for line in text.split('\n'):
    if not line.strip():
        continue

    product, size, price, location, comment = line.strip().split(maxsplit=4)
    print(f'Product: {product} Size: {size} Price: {price} Location: {location} Comment:{comment}')

@omri-suissa-clearmash
Copy link
Author

@AlbertWeichselbraun thank you. However, this is was just an example. what I was looking for is a generic way to set a rule or callback on every table found by inscriptis. is it possible?

@omri-suissa-clearmash
Copy link
Author

For example, consider the following table:

<table border="1" align="center" cellpadding="10px">
  <thead>
    <tr>
      <th rowspan="3">Type</th>
      <th rowspan="3">Day</th>
      <th colspan="3">Seminar</th>
    </tr>
    <tr>
      <th colspan="2">Schedule</th>
      <th rowspan="2">Topic</th>
    </tr>
    <tr>
      <th>Begin</th>
      <th>End</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th rowspan="5" scope="row">Class</th>
    </tr>
    <tr>
      <td rowspan="2">Sunday</td>
      <td rowspan="2">9:00 a.m</td>
      <td rowspan="2">6:00 p.m</td>
      <td>Introduction to XML</td>
    </tr>
    <tr>
      <td>Validity: YYY and Relax VV</td>
    </tr>
    <tr>
      <td rowspan="2">Monday</td>
      <td rowspan="2">8:00 a.m</td>
      <td rowspan="2">5:00 p.m</td>
      <td>Introduction to TSQL</td>
    </tr>
    <tr>
      <td>Validity: DTD and Relax NG</td>
    </tr>
    <tr>
      <th rowspan="4" scope="row">Skill</th>
      <td rowspan="4">Tuesday</td>
      <td>8:00 a.m</td>
      <td>11:00 a.m</td>
      <td rowspan="2">XPath</td>
    </tr>
    <tr>
      <td rowspan="2">12:00 a.m</td>
      <td rowspan="2">2:00 p.m</td>
    </tr>
    <tr>
      <td rowspan="2">XSL transformation</td>
    </tr>
    <tr>
      <td>3:00 p.m</td>
      <td>5:00 p.m</td>
    </tr>
    <tr>
      <th rowspan="1" scope="row">Test</th>
      <td>Wednesday</td>
      <td>7:00 a.m</td>
      <td>10:00 p.m</td>
      <td>XLS Formatting Objects</td>
    </tr>

  </tbody>
  </table>

the running text if the first row should be:
type: class, Day: Sunday, Seminar Schedule Begin 9:00 a.m, Seminar Schedule End 6:00 p.m, Seminar Topic: Introduction to XML, Seminar Topic: Validity: YYY and Relax VV

@AlbertWeichselbraun
Copy link
Contributor

my recommendation would be to use

  • a tabulator (\t) as custom table separator and
  • inscriptis' annotation postprocessors to obtain the positions of tables within the text

you could then automatically extract tables, and split them into columns based on the tabulator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants